I want to compare column values from two different data frames, if both values match, insert a new Column in the data frame and add a value in the new column.
DF One:
col name col item
0 jo abc
1 bo efg
2 mo xyz
DF Two:
col name col item col code
0 jo abc , xyz 123
1 bo efg , xyz , zyx 456
2 mo abc, xyz 789
I'm trying to compare the col_name
and col_item
values in DF One to the values in DF Two. If the name
and item
from DF One match values from DF Two, pull the corresponding col code
value from DF Two into DF One. Each code number is set to a name and item. The results should look like this
DF One: End Result
col_name col_item new_col
0 jo abc 123
1 bo efg 789
2 mo xyz 456
Here is the code I've been trying. My logic is if name and item from DF Two match name and item from DF One, pull col code
value from DF Two into DF One
if df_two[' col name '] == df_one['col name '] & df_two [' col item '] == df_one[' col item ']:
df_one['new col'] = df_two['col code']
df_one.head(5).to_dict
{'Date': {0: '8/24/2021',
1: '8/17/2021',
2: '8/19/2021',
3: '8/19/2021',
4: '8/19/2021'},
'ID Number': {0: 123213,
1: 4564564',
2: '789789',
3: '735764',
4: '1010101'},
'col name': {0: 'mo',
1: 'bo',
2: 'jo,LLP',
3: 'do, LLP',
4: 'to, LLP'},
'Amount': {0: 900.0, 1: 105.0, 2: 1.02, 3: 132.0, 4: 8.0},
'Tax': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'col item': {0: ' abc - bc - zxy ',
1: ' cba - abc – zx ',
2: ' hij ',
3: ' lmn - op – xyz',
4: ' lmn - ac – mno'},
'BBNumber': {0: '30', 1: '30', 2: '30', 3: '30', 4: '30'}}
df_two.head(5).to_dict
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'col name': {0: 'mo',
1: 'bo',
2: 'jo,LLP',
3: 'do, LLP',
4: 'to, LLP'},
'col code': {0: 123, 1: 456, 2: 789, 3: 987, 4: 654},
'col item': {0: ' abc - bc - zxy, lmn - ac – mno, cba’,
1: cba - abc – zx, lmn - op - xyz',
2: 'hij , qrx',
3: ' lmn - op – xyz, abc’,
4: 'lmn - ac – mno’}}
CodePudding user response:
You can try to split the string in col_item
in df_two
by str.split()
into lists, then use .explode()
to create rows with each item in the list in one row. Then, we .merge()
with df_one
on the 2 columns. Finally, rename the column col_code
to new_col
by .rename()
, as follows:
df2a = df_two.assign(col_item=df_two['col_item'].str.split(r'\s*,\s*')).explode('col_item')
(df_one.merge(df2a, on=['col_name', 'col_item'])
.rename({'col_code': 'new_col'}, axis=1)
)
Result:
col_name col_item new_col
0 jo abc 123
1 bo efg 789
2 mo xyz 456
Edit
Here is a version without using .assign()
:
df2a = df_two.copy()
df2a['col_item'] = df2a['col_item'].str.split(r'\s*,\s*')
df2a = df2a.explode('col_item')
(df_one.merge(df2a, on=['col_name', 'col_item'])
.rename({'col_code': 'new_col'}, axis=1)
)
Edit 2
As revealed from your dataset dumps by to_dict()
, your strings in column col item
in source dataframes df_one
and df_two
actually contain some leading and/or trailing spaces. This is the reason why some strings from the 2 dataframes cannot be successfully matched.
E.g. in df_one
, you have string ' abc - bc - zxy '
(with spaces at both ends), while from df_two
, you have string ' abc - bc - zxy, lmn - ac – mno, cba'
(the first part before comma has only one space before it but no space before the comma). Hence, we can't match these strings without stripping the extra white spaces at both ends.
Here's the revised codes added with codes to strip the extra white spaces at both ends for column col item
for both dataframes:
df2a = df_two.copy()
df2a['col item'] = df2a['col item'].str.split(r'\s*,\s*')
df2a = df2a.explode('col item')
df2a['col item'] = df2a['col item'].str.strip() # strip extra white spaces at both ends
df1a = df_one.copy()
df1a['col item'] = df1a['col item'].str.strip() # strip extra white spaces at both ends
df_out = (df1a.merge(df2a, on=['col name', 'col item'])
.rename({'col code': 'new_col'}, axis=1)
)
Data Input
(With rectification of typos in your to_dict()
data dumps)
d1 = {'Date': {0: '8/24/2021', 1: '8/17/2021', 2: '8/19/2021', 3: '8/19/2021', 4: '8/19/2021'},
'ID Number': {0: '123213', 1: '4564564', 2: '789789', 3: '735764', 4: '1010101'},
'col name': {0: 'mo', 1: 'bo', 2: 'jo,LLP', 3: 'do, LLP', 4: 'to, LLP'},
'Amount': {0: 900.0, 1: 105.0, 2: 1.02, 3: 132.0, 4: 8.0},
'Tax': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'col item': {0: ' abc - bc - zxy ',
1: ' cba - abc – zx ',
2: ' hij ',
3: ' lmn - op – xyz',
4: ' lmn - ac – mno'},
'BBNumber': {0: '30', 1: '30', 2: '30', 3: '30', 4: '30'}}
df_one = pd.DataFrame(d1)
Date ID Number col name Amount Tax col item BBNumber
0 8/24/2021 123213 mo 900.00 0.0 abc - bc - zxy 30
1 8/17/2021 4564564 bo 105.00 0.0 cba - abc – zx 30
2 8/19/2021 789789 jo,LLP 1.02 0.0 hij 30
3 8/19/2021 735764 do, LLP 132.00 0.0 lmn - op – xyz 30
4 8/19/2021 1010101 to, LLP 8.00 0.0 lmn - ac – mno 30
d2 = {'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'col name': {0: 'mo', 1: 'bo', 2: 'jo,LLP', 3: 'do, LLP', 4: 'to, LLP'},
'col code': {0: 123, 1: 456, 2: 789, 3: 987, 4: 654},
'col item': {0: ' abc - bc - zxy, lmn - ac – mno, cba',
1: 'cba - abc – zx, lmn - op - xyz',
2: 'hij , qrx',
3: ' lmn - op – xyz, abc',
4: 'lmn - ac – mno'}}
df_two = pd.DataFrame(d2)
Unnamed: 0 col name col code col item
0 0 mo 123 abc - bc - zxy, lmn - ac – mno, cba
1 1 bo 456 cba - abc – zx, lmn - op - xyz
2 2 jo,LLP 789 hij , qrx
3 3 do, LLP 987 lmn - op – xyz, abc
4 4 to, LLP 654 lmn - ac – mno
Note that there is already a column Unnamed: 0
in your source dataframe. It is NOT introduced by running the solution codes.
Output
print(df_out)
Date ID Number col name Amount Tax col item BBNumber Unnamed: 0 new_col
0 8/24/2021 123213 mo 900.00 0.0 abc - bc - zxy 30 0 123
1 8/17/2021 4564564 bo 105.00 0.0 cba - abc – zx 30 1 456
2 8/19/2021 789789 jo,LLP 1.02 0.0 hij 30 2 789
3 8/19/2021 735764 do, LLP 132.00 0.0 lmn - op – xyz 30 3 987
4 8/19/2021 1010101 to, LLP 8.00 0.0 lmn - ac – mno 30 4 654
You can also extract only some of the result columns, for example:
df_out2 = df_out[['col name', 'col item', 'new_col']]
Result:
print(df_out2)
col name col item new_col
0 mo abc - bc - zxy 123
1 bo cba - abc – zx 456
2 jo,LLP hij 789
3 do, LLP lmn - op – xyz 987
4 to, LLP lmn - ac – mno 654
CodePudding user response:
If you want to add a new column to an existing dataframe you need to use assign, which will add a pandas Series as a new column. Also, the output of
df_two [' col_item '] == df_one[' col_item ']
should be
0 False
1 False
2 False
Name: col_item, dtype: bool
The shapes of the columns you are comparing are the same.
CodePudding user response:
import pandas as pd
df1 = pd.DataFrame({'col_name': ["jo", "bo", "mo"], 'col_item': ['abc', 'efg', 'xyz']})
df2 = pd.DataFrame({'col_name': ["jo", "bo"], 'col_item': ["abc,xyz", "efg,xyz,zyx"], 'col_code': ["123", "456"]})
print(df1)
print(df2)
You should merge the two frames on the common column. See https://stackoverflow.com/a/53645883/496289.
df3 = df1.merge(df2, on='col_name') print(df3)
Remove unwanted rows.
def is_col1_in_col2(row): return row.col_item_x in row.col_item_y df4 = df3[df3.apply(is_col1_in_col2, axis=1)] print(df4)
Remove unwanted columns.
df5 = df4.drop('col_item_y', 1).rename(columns={'col_item_x':'col_item'}) print(df5)
output:
df1
col_name col_item
0 jo abc
1 bo efg
2 mo xyz
df2
col_name col_item col_code
0 jo abc,xyz 123
1 bo efg,xyz,zyx 456
df3
col_name col_item_x col_item_y col_code
0 jo abc abc,xyz 123
1 bo efg efg,xyz,zyx 456
df4
col_name col_item_x col_item_y col_code
0 jo abc abc,xyz 123
1 bo efg efg,xyz,zyx 456
df5
col_name col_item col_code
0 jo abc 123
1 bo efg 456