Comparing & matching comma separated strings from different dataframes Python pandas-CodePudding

I want to compare column values from two different data frames, if both values match, insert a new Column in the data frame and add a value in the new column.

DF One:

      col name    col item  
 0     jo          abc    
 1     bo          efg   
 2     mo          xyz

DF Two:

      col name   col item          col code
 0     jo         abc , xyz          123
 1     bo         efg , xyz , zyx    456
 2     mo         abc,  xyz          789

I'm trying to compare the col_name and col_item values in DF One to the values in DF Two. If the name and item from DF One match values from DF Two, pull the corresponding col code value from DF Two into DF One. Each code number is set to a name and item. The results should look like this

DF One: End Result

       col_name   col_item     new_col
   0    jo         abc           123
   1    bo         efg           789
   2    mo         xyz           456

Here is the code I've been trying. My logic is if name and item from DF Two match name and item from DF One, pull col code value from DF Two into DF One

if df_two[' col name '] == df_one['col name '] & df_two [' col item '] == df_one[' col item ']:
    df_one['new col'] = df_two['col code']

df_one.head(5).to_dict

{'Date': {0: '8/24/2021',
   1: '8/17/2021',
   2: '8/19/2021',
   3: '8/19/2021',
   4: '8/19/2021'},
  'ID Number': {0: 123213,
   1: 4564564',
   2: '789789',
   3: '735764',
   4: '1010101'},
  'col name': {0: 'mo',
   1: 'bo',
   2: 'jo,LLP',
   3: 'do, LLP',
   4: 'to, LLP'},
  'Amount': {0: 900.0, 1: 105.0, 2: 1.02, 3: 132.0, 4: 8.0},
  'Tax': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
  'col item': {0: ' abc - bc - zxy ',
   1: ' cba - abc – zx ',
   2: ' hij ',
   3: ' lmn - op – xyz',
   4: ' lmn - ac – mno'},
  'BBNumber': {0: '30', 1: '30', 2: '30', 3: '30', 4: '30'}}

df_two.head(5).to_dict

{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
  'col name': {0: 'mo',
   1: 'bo',
   2: 'jo,LLP',
   3: 'do, LLP',
   4: 'to, LLP'},
  'col code': {0: 123, 1: 456, 2: 789, 3: 987, 4: 654},
  'col item': {0: ' abc - bc - zxy, lmn - ac – mno, cba’,
   1: cba - abc – zx, lmn - op - xyz',
   2: 'hij , qrx',
   3: ' lmn - op – xyz, abc’,
   4: 'lmn - ac – mno’}}

CodePudding user response：

You can try to split the string in col_item in df_two by str.split() into lists, then use .explode() to create rows with each item in the list in one row. Then, we .merge() with df_one on the 2 columns. Finally, rename the column col_code to new_col by .rename(), as follows:

df2a = df_two.assign(col_item=df_two['col_item'].str.split(r'\s*,\s*')).explode('col_item')

(df_one.merge(df2a, on=['col_name', 'col_item'])
       .rename({'col_code': 'new_col'}, axis=1)
)

Result:

  col_name col_item  new_col
0       jo      abc      123
1       bo      efg      789
2       mo      xyz      456

Edit

Here is a version without using .assign():

df2a = df_two.copy()
df2a['col_item'] = df2a['col_item'].str.split(r'\s*,\s*')
df2a = df2a.explode('col_item')

(df_one.merge(df2a, on=['col_name', 'col_item'])
       .rename({'col_code': 'new_col'}, axis=1)
)

Edit 2

As revealed from your dataset dumps by to_dict(), your strings in column col item in source dataframes df_one and df_two actually contain some leading and/or trailing spaces. This is the reason why some strings from the 2 dataframes cannot be successfully matched.

E.g. in df_one, you have string ' abc - bc - zxy ' (with spaces at both ends), while from df_two, you have string ' abc - bc - zxy, lmn - ac – mno, cba' (the first part before comma has only one space before it but no space before the comma). Hence, we can't match these strings without stripping the extra white spaces at both ends.

Here's the revised codes added with codes to strip the extra white spaces at both ends for column col item for both dataframes:

df2a = df_two.copy()
df2a['col item'] = df2a['col item'].str.split(r'\s*,\s*')
df2a = df2a.explode('col item')
df2a['col item'] = df2a['col item'].str.strip()   # strip extra white spaces at both ends 

df1a = df_one.copy()
df1a['col item'] = df1a['col item'].str.strip()   # strip extra white spaces at both ends 

df_out = (df1a.merge(df2a, on=['col name', 'col item'])
              .rename({'col code': 'new_col'}, axis=1)
         )

Data Input

(With rectification of typos in your to_dict() data dumps)

d1 = {'Date': {0: '8/24/2021', 1: '8/17/2021', 2: '8/19/2021', 3: '8/19/2021', 4: '8/19/2021'},
  'ID Number': {0: '123213', 1: '4564564', 2: '789789', 3: '735764', 4: '1010101'},
  'col name': {0: 'mo', 1: 'bo', 2: 'jo,LLP', 3: 'do, LLP', 4: 'to, LLP'},
  'Amount': {0: 900.0, 1: 105.0, 2: 1.02, 3: 132.0, 4: 8.0},
  'Tax': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
  'col item': {0: ' abc - bc - zxy ', 
               1: ' cba - abc – zx ', 
               2: ' hij ', 
               3: ' lmn - op – xyz', 
               4: ' lmn - ac – mno'},
  'BBNumber': {0: '30', 1: '30', 2: '30', 3: '30', 4: '30'}}
df_one = pd.DataFrame(d1)

        Date ID Number col name  Amount  Tax          col item BBNumber
0  8/24/2021    123213       mo  900.00  0.0   abc - bc - zxy        30
1  8/17/2021   4564564       bo  105.00  0.0   cba - abc – zx        30
2  8/19/2021    789789   jo,LLP    1.02  0.0              hij        30
3  8/19/2021    735764  do, LLP  132.00  0.0    lmn - op – xyz       30
4  8/19/2021   1010101  to, LLP    8.00  0.0    lmn - ac – mno       30

d2 = {'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
  'col name': {0: 'mo', 1: 'bo', 2: 'jo,LLP', 3: 'do, LLP', 4: 'to, LLP'}, 
  'col code': {0: 123, 1: 456, 2: 789, 3: 987, 4: 654},
  'col item': {0: ' abc - bc - zxy, lmn - ac – mno, cba', 
               1: 'cba - abc – zx, lmn - op - xyz', 
               2: 'hij , qrx', 
               3: ' lmn - op – xyz, abc', 
               4: 'lmn - ac – mno'}}
df_two = pd.DataFrame(d2)

   Unnamed: 0 col name  col code                              col item
0           0       mo       123   abc - bc - zxy, lmn - ac – mno, cba
1           1       bo       456        cba - abc – zx, lmn - op - xyz
2           2   jo,LLP       789                             hij , qrx
3           3  do, LLP       987                   lmn - op – xyz, abc
4           4  to, LLP       654                        lmn - ac – mno

Note that there is already a column Unnamed: 0 in your source dataframe. It is NOT introduced by running the solution codes.

Output

print(df_out)


        Date ID Number col name  Amount  Tax        col item BBNumber  Unnamed: 0  new_col
0  8/24/2021    123213       mo  900.00  0.0  abc - bc - zxy       30           0      123
1  8/17/2021   4564564       bo  105.00  0.0  cba - abc – zx       30           1      456
2  8/19/2021    789789   jo,LLP    1.02  0.0             hij       30           2      789
3  8/19/2021    735764  do, LLP  132.00  0.0  lmn - op – xyz       30           3      987
4  8/19/2021   1010101  to, LLP    8.00  0.0  lmn - ac – mno       30           4      654

You can also extract only some of the result columns, for example:

df_out2 = df_out[['col name', 'col item', 'new_col']]

Result:

print(df_out2)

  col name        col item  new_col
0       mo  abc - bc - zxy      123
1       bo  cba - abc – zx      456
2   jo,LLP             hij      789
3  do, LLP  lmn - op – xyz      987
4  to, LLP  lmn - ac – mno      654

CodePudding user response：

If you want to add a new column to an existing dataframe you need to use assign, which will add a pandas Series as a new column. Also, the output of df_two [' col_item '] == df_one[' col_item ']

should be

0    False
1    False
2    False
Name: col_item, dtype: bool

The shapes of the columns you are comparing are the same.

CodePudding user response：

import pandas as pd

df1 = pd.DataFrame({'col_name': ["jo", "bo", "mo"], 'col_item': ['abc', 'efg', 'xyz']})
df2 = pd.DataFrame({'col_name': ["jo", "bo"], 'col_item': ["abc,xyz", "efg,xyz,zyx"], 'col_code': ["123", "456"]})

print(df1)
print(df2)

You should merge the two frames on the common column. See https://stackoverflow.com/a/53645883/496289.
```
  df3 = df1.merge(df2, on='col_name')
  print(df3)
```

Remove unwanted rows.

  def is_col1_in_col2(row):
      return row.col_item_x in row.col_item_y

  df4 = df3[df3.apply(is_col1_in_col2, axis=1)]
  print(df4)

Remove unwanted columns.

  df5 = df4.drop('col_item_y', 1).rename(columns={'col_item_x':'col_item'})
  print(df5)

output:

df1
   col_name col_item
0       jo      abc
1       bo      efg
2       mo      xyz

df2
   col_name     col_item col_code
0       jo      abc,xyz      123
1       bo  efg,xyz,zyx      456

df3
   col_name col_item_x   col_item_y col_code
0       jo        abc      abc,xyz      123
1       bo        efg  efg,xyz,zyx      456

df4
   col_name col_item_x   col_item_y col_code
0       jo        abc      abc,xyz      123
1       bo        efg  efg,xyz,zyx      456

df5
   col_name col_item col_code
0       jo      abc      123
1       bo      efg      456