One to multiple merge two dataframes if one column string contained in another with Python-CodePudding

I have two dataframes that I would like to merge based on if column value of words from df1 contains column value of keywords from df2. I've been trying to use str.extract. But no luck so far to get the expected result. Example below:

df1:

[{'id': 1, 'words': 'chellomedia', 'languages': nan},
 {'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
 {'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
 {'id': 4, 'words': 'hello world', 'languages': nan},
 {'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
 {'id': 6, 'words': 'hello kitty', 'languages': 'English'},
 {'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
 {'id': 8, 'words': 'hola world', 'languages': nan}]

df2:

[{'code': 1, 'keywords': 'Hello'},
 {'code': 2, 'keywords': 'hola'},
 {'code': 3, 'keywords': 'world'}]

My trial code:

df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()

pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('('   pat   ')', expand=False))

pd.merge(df1, df2, on='keywords', how='left')

Out:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4     hola   5  ¡hola mundo!       Spanish   2.0
5    hello   6   hello kitty       English   1.0
6      NaN   7   ciao mondo!       Italian   NaN
7     hola   8    hola world           NaN   2.0

But the desired one should be like this:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0  ---> should be generated in df
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8     hola   8    hola world           NaN   2.0
9    world   8    hola world           NaN   3.0  ---> should be generated in df

How could I generate the expected result? Thanks.

CodePudding user response：

Instead of extract you have to use findall and explode, example:

df1.insert(0, 'keywords', df1['words'].str.findall('('   pat   ')'))
print(pd.merge(df1.explode('keywords'), df2, on='keywords', how='left')
        .sort_values('id').reset_index(drop=True))

Output:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8    world   8    hola world           NaN   3.0
9     hola   8    hola world           NaN   2.0

Exactly the same as what you need :)