Given 2 DataFrames :
DF1 (a Serie) :
Col_Name_1
[['A, B']
['B']
['A, C']
['B, C']]
DF2 :
Col_Name_X, Col_Name_Y
[['A', 'Paris']
['B', 'London']
['C', 'Mexico']]
In DF1, I need to replace all values A, B, C, using DF2 which is corresponding table. Critical point is that in some rows, there are multiple values to replace : ['A, B'] for instance in first row.
I tried 3 ways : 1- split Col_Name_1 to have each value in a separate column and loop with a merge fonction 2- create a dictionnary from DF2 and use replace function on DF1 3- create a dictionnary from DF2 and use map function on DF2
Issue I have is that my DataFrames have 1 000 000 rows. 1 & 2 are consumming too much memory, CPU and time. I cannot get a result. 3 is fast but does not match multiple values from DF1 : A, B return NaN
I assume that it must exist a simple way to do this... Any idea please ?
Thanks
CodePudding user response:
Use mapping by dictionary with splitted values in column in df1
with dict.get
, if no match return same value:
d = df2.set_index('Col_Name_X')['Col_Name_Y'].to_dict()
f = lambda x: ', '.join(d.get(y,y) for y in x.split(', '))
df1['Col_Name_1'] = df1['Col_Name_1'].apply(f)
print (df1)
Col_Name_1
0 Paris, London
1 London
2 Paris, Mexico
3 London, Mexico