Efficient data manipulation with pandas based on 2 dataframes-CodePudding

Here's my code with 2 dataframes:

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1, 2, 3, 5, 2], [2, 2, 3, 5, 2], [3, 2, 3, 5, 2], [10, 2, 3, 5, 2]]),
                   columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])
df2 = pd.DataFrame(np.array([[1,1,1, 2,2,2, 3,3,3, 10,10,10], [2,3,5, 2,3,5, 2,3,5, 2,3,5], [20,40,60, 80,100,200, 220,240,260, 500,505,520]]).T,
                   columns=['ID', 'Item_id', 'value_to_assign'])

Based on df2 I want to modify df1

Expected output:

df_expected_output = pd.DataFrame(np.array([[1, 20, 40, 60, 20], [2, 80, 100, 200, 80], [3, 220, 240, 260, 220], [10, 500, 505, 520, 500]]),
                   columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])

I have done it with iterating over columns and some operations. Hovever in my example i got more columns and rows in dataframes, so its pretty slow. Someone know how to do it in fast efficient way? Thanks

CodePudding user response：

Here is one solution. pivot df2 to have a format similar to df1 and then replace column by column by matching on the number after the last '_'.

df2_pivot = df2.pivot(index='ID', columns='Item_id', values='value_to_assign').rename_axis(None, axis=1)

df3 = df1.set_index('ID')
for c in df3:
    df3[c] = df2_pivot[int(c.rsplit('_', 1)[-1])]

Or, using a dictionary comprehension for the second part:

df3 = pd.DataFrame({c: df2_pivot[int(c.rsplit('_', 1)[-1])]
                    for c in df1.columns[1:]},
                    index=df1['ID']).reset_index()

output:

>>> df3.reset_index()
   ID  itemX_2  itemK_3  itemC_5  itemH_2
0   1       20       40       60       20
1   2       80      100      200       80
2   3      220      240      260      220
3  10      500      505      520      500

CodePudding user response：

Another way would be:

Stack the original df which is to be replaced.
grab the index and split the second index to get values after _
using pd.Index.map, map the values of these index from df2
Create a dataframe keeping this mapped value as value and the stacked multiindex as index and then unstack them.

s = df1.set_index("ID").stack()
i = s.index.map(lambda x: (x[0],x[1].split("_")[1]))
v = i.map(df2.set_index(["ID",df2['Item_id'].map(str)])['value_to_assign'])
out = pd.DataFrame({"value":v},index=s.index)['value'].unstack().reset_index()

print(out)

   ID  itemX_2  itemK_3  itemC_5  itemH_2
0   1       20       40       60       20
1   2       80      100      200       80
2   3      220      240      260      220
3  10      500      505      520      500

CodePudding user response：

`DataFrame.replace`

We can use pivot to reshape the dataframe df2 so that we can easily use replace method to substitute the values in df1

df1.set_index('ID').T.replace(df2.pivot('Item_id', 'ID', 'value_to_assign')).T

    itemX_2  itemK_3  itemC_5  itemH_2
ID                                    
1        20       40       60       20
2        80      100      200       80
3       220      240      260      220
10      500      505      520      500

CodePudding user response：

You can iterate over the columns of df1 and perform a pd.merge :

for col in df1.columns:
    if col == 'ID': continue
    df_temp = pd.merge(df1.loc[:, ['ID', col]], df2, on = 'ID')
    df1[col] = df_temp[df_temp[col] == df_temp['Item_id']]['value_to_assign'].reset_index(drop=True)

output:

   ID  itemX_2  itemK_3  itemC_5  itemH_2
0   1       20       40       60       20
1   2       80      100      200       80
2   3      220      240      260      220
3  10      500      505      520      500