I have a pandas dataframe, df1
. I have another pandas time frame, df2
, with fruits
columns that I would like to replace the elements inside the list that are found in the duplicates
column in df1
with value of the name
column in df1
.
df1
name duplicates
0 a.apple ['b.apple', 'c.apple']
1 t.orange ['arr.orange', 'pg.orange']
2 ts.grape ['a.grape' , 'test.grape']
3 u.berryCool ['X.berryCool', 'cool.berryCool']
df2
people fruits
0 jack ['b.apple', 'c.apple', 'pp.tomato', 'ao.banana' ]
1 mary ['arr.orange', 'b.apple', 'X.berryCool', 'op.mango']
2 andy ['cool.berryCool' , 'test.grape', 'yu.papaya']
3 lawrence ['jc.orange', 'c.apple']
Expected Output
people fruits
0 jack ['a.apple', 'a.apple', 'pp.tomato', 'ao.banana' ]
1 mary ['t.orange', 'a.apple', 'u.berryCool', 'op.mango']
2 andy ['u.berryCool' , 'ts.grape', 'yu.papaya']
3 lawrence ['t.orange' , 'a.apple']
How can I accomplish this efficiently? Any suggestion is appreciated.
CodePudding user response:
You can create a mapping dict (Series):
MAPPING = df1.explode('duplicates').set_index('duplicates')['name']
df2['fruits'] = (df2.explode('fruits')['fruits'].replace(MAPPING)
.groupby(level=0).agg(list))
print(df2)
# Output
people fruits
0 jack [a.apple, a.apple, pp.tomato, ao.banana]
1 mary [t.orange, a.apple, u.berryCool, op.mango]
2 andy [u.berryCool, ts.grape, yu.papaya]
3 lawrence [jc.orange, a.apple]
CodePudding user response:
Create dictionary by flatten values in list in column duplicates
first and then mapping values with dict.get
- if no match return same value:
d = {x: a for a, b in zip(df1['name'], df1['duplicates']) for x in b}
df2['fruits'] = [[d.get(y,y) for y in x] for x in df2['fruits']]
print (df2)
people fruits
0 jack [a.apple, a.apple, pp.tomato, ao.banana]
1 mary [t.orange, a.apple, u.berryCool, op.mango]
2 andy [u.berryCool, ts.grape, yu.papaya]
3 lawrence [jc.orange, a.apple]
Performance in 4k DataFrame: (depends of data, best test real data)
df2 = pd.concat([df2] * 1000, ignore_index=True)
In [135]: %%timeit
...: MAPPING = df1.explode('duplicates').set_index('duplicates')['name']
...: df2['fruits1'] = (df2.explode('fruits')['fruits'].replace(MAPPING).groupby(level=0).agg(list))
...:
128 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [136]: %%timeit
...: d = {x: a for a, b in zip(df1['name'], df1['duplicates']) for x in b}
...:
...: df2['fruits2'] = [[d.get(y,y) for y in x] for x in df2['fruits']]
...:
5.27 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)