I have two dataframes that I would like to merge based on if column value of words
from df1
contains column value of keywords
from df2
. I've been trying to use str.extract
. But no luck so far to get the expected result. Example below:
df1:
[{'id': 1, 'words': 'chellomedia', 'languages': nan},
{'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
{'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
{'id': 4, 'words': 'hello world', 'languages': nan},
{'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
{'id': 6, 'words': 'hello kitty', 'languages': 'English'},
{'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
{'id': 8, 'words': 'hola world', 'languages': nan}]
df2:
[{'code': 1, 'keywords': 'Hello'},
{'code': 2, 'keywords': 'hola'},
{'code': 3, 'keywords': 'world'}]
My trial code:
df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()
pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('(' pat ')', expand=False))
pd.merge(df1, df2, on='keywords', how='left')
Out:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 hola 5 ¡hola mundo! Spanish 2.0
5 hello 6 hello kitty English 1.0
6 NaN 7 ciao mondo! Italian NaN
7 hola 8 hola world NaN 2.0
But the desired one should be like this:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 world 4 hello world NaN 3.0 ---> should be generated in df
5 hola 5 ¡hola mundo! Spanish 2.0
6 hello 6 hello kitty English 1.0
7 NaN 7 ciao mondo! Italian NaN
8 hola 8 hola world NaN 2.0
9 world 8 hola world NaN 3.0 ---> should be generated in df
How could I generate the expected result? Thanks.
CodePudding user response:
Instead of extract
you have to use findall
and explode
, example:
df1.insert(0, 'keywords', df1['words'].str.findall('(' pat ')'))
print(pd.merge(df1.explode('keywords'), df2, on='keywords', how='left')
.sort_values('id').reset_index(drop=True))
Output:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 world 4 hello world NaN 3.0
5 hola 5 ¡hola mundo! Spanish 2.0
6 hello 6 hello kitty English 1.0
7 NaN 7 ciao mondo! Italian NaN
8 world 8 hola world NaN 3.0
9 hola 8 hola world NaN 2.0
Exactly the same as what you need :)