Home > Enterprise >  Efficient data manipulation with pandas based on 2 dataframes
Efficient data manipulation with pandas based on 2 dataframes

Time:09-24

Here's my code with 2 dataframes:

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1, 2, 3, 5, 2], [2, 2, 3, 5, 2], [3, 2, 3, 5, 2], [10, 2, 3, 5, 2]]),
                   columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])
df2 = pd.DataFrame(np.array([[1,1,1, 2,2,2, 3,3,3, 10,10,10], [2,3,5, 2,3,5, 2,3,5, 2,3,5], [20,40,60, 80,100,200, 220,240,260, 500,505,520]]).T,
                   columns=['ID', 'Item_id', 'value_to_assign'])

Based on df2 I want to modify df1

Expected output:

df_expected_output = pd.DataFrame(np.array([[1, 20, 40, 60, 20], [2, 80, 100, 200, 80], [3, 220, 240, 260, 220], [10, 500, 505, 520, 500]]),
                   columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])

I have done it with iterating over columns and some operations. Hovever in my example i got more columns and rows in dataframes, so its pretty slow. Someone know how to do it in fast efficient way? Thanks

CodePudding user response:

Here is one solution. pivot df2 to have a format similar to df1 and then replace column by column by matching on the number after the last '_'.

df2_pivot = df2.pivot(index='ID', columns='Item_id', values='value_to_assign').rename_axis(None, axis=1)

df3 = df1.set_index('ID')
for c in df3:
    df3[c] = df2_pivot[int(c.rsplit('_', 1)[-1])]

Or, using a dictionary comprehension for the second part:

df3 = pd.DataFrame({c: df2_pivot[int(c.rsplit('_', 1)[-1])]
                    for c in df1.columns[1:]},
                    index=df1['ID']).reset_index()

output:

>>> df3.reset_index()
   ID  itemX_2  itemK_3  itemC_5  itemH_2
0   1       20       40       60       20
1   2       80      100      200       80
2   3      220      240      260      220
3  10      500      505      520      500

CodePudding user response:

Another way would be:

  1. Stack the original df which is to be replaced.
  2. grab the index and split the second index to get values after _
  3. using pd.Index.map, map the values of these index from df2
  4. Create a dataframe keeping this mapped value as value and the stacked multiindex as index and then unstack them.

s = df1.set_index("ID").stack()
i = s.index.map(lambda x: (x[0],x[1].split("_")[1]))
v = i.map(df2.set_index(["ID",df2['Item_id'].map(str)])['value_to_assign'])
out = pd.DataFrame({"value":v},index=s.index)['value'].unstack().reset_index()

print(out)

   ID  itemX_2  itemK_3  itemC_5  itemH_2
0   1       20       40       60       20
1   2       80      100      200       80
2   3      220      240      260      220
3  10      500      505      520      500

CodePudding user response:

DataFrame.replace

We can use pivot to reshape the dataframe df2 so that we can easily use replace method to substitute the values in df1

df1.set_index('ID').T.replace(df2.pivot('Item_id', 'ID', 'value_to_assign')).T

    itemX_2  itemK_3  itemC_5  itemH_2
ID                                    
1        20       40       60       20
2        80      100      200       80
3       220      240      260      220
10      500      505      520      500

CodePudding user response:

You can iterate over the columns of df1 and perform a pd.merge :

for col in df1.columns:
    if col == 'ID': continue
    df_temp = pd.merge(df1.loc[:, ['ID', col]], df2, on = 'ID')
    df1[col] = df_temp[df_temp[col] == df_temp['Item_id']]['value_to_assign'].reset_index(drop=True)

output:

   ID  itemX_2  itemK_3  itemC_5  itemH_2
0   1       20       40       60       20
1   2       80      100      200       80
2   3      220      240      260      220
3  10      500      505      520      500
  • Related