I have two dataframes:
df1 is a reference table with a list of individual codes and their corresponding values.
df2 is a excerpt from a larger dataset, wherein one of the columns will contain multiple examples of the codes. It will also contain other values I want to ignore e.g. blanks and 'Not Applicable'.
I need to split out each individual code from df2 and find the corresponding value from the reference table df1. I then want to return a column in df2 with the maximum value from the entire string of codes.
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
I had previously used the following:
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
However, the .append function is being removed and when used I get the following error "'>' not supported between instances of 'numpy.ndarray' and 'str'"
CodePudding user response:
Code
df2['max'] = (
df2['Code']
.str.split(',')
.explode()
.map(df1.set_index('Code')['Value'])
.groupby(level=0).max()
)
How it works?
- Split by delimiter
,
- Explode to convert lists to rows
- Use
map
to substitute values fromdf1
- Groupby on
level=0
to findmax
value per row group
Result
Code max
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN
CodePudding user response:
One can use pandas.Series.apply
with a custom lambda function as follows
df2['Max_Value'] = df2['Code'].apply(lambda x: max([df1.loc[df1['Code'] == i, 'Value'].values[0] for i in x.split(',') if i != 'Not Applicable' and i != ''], default=0))
[Out]:
Code Max_Value
0 H302,H304 18
1 H332,H319,H312,H320,H316,H315,H302,H304 18
2 H315,H312,H316 17
3 H320,H332,H316,H315,H304,H302,H312 18
4 H315,H319,H312,H316,H332 17
5 H312 17
6 Not Applicable 0
7 0
Given the first note below, if one doesn't want to use .apply()
, one can use a list comprehension as follows
df2['Max_Value'] = [max([df1.loc[df1['Code'] == i, 'Value'].values[0] for i in x.split(',') if i != 'Not Applicable' and i != ''], default=0) for x in df2['Code']]
[Out]:
Code Max_Value
0 H302,H304 18
1 H332,H319,H312,H320,H316,H315,H302,H304 18
2 H315,H312,H316 17
3 H320,H332,H316,H315,H304,H302,H312 18
4 H315,H319,H312,H316,H332 17
5 H312 17
6 Not Applicable 0
7 0
Notes:
- There are strong opinions on using
.apply()
, so one might want to read this.
CodePudding user response:
As an alternative:
keys=df1['Code'].to_list()
df2["Code"] = df2["Code"].str.split(',')
def get_max(x):
max_=[]
for i in x:
if i in keys:
max_.append(df1.loc[df1.Code==i, 'Value'].values[0])
else:
pass
if len(max_)>0:
return max(max_)
else:
[]
df2['max_value']=df2['Code'].apply(lambda x:get_max(x))
print(df2)
'''
Code max_value
0 ['H302', 'H304'] 18.0
1 ['H332', 'H319', 'H312', 'H320', 'H316', 'H315', 'H302', 'H304'] 18.0
2 ['H315', 'H312', 'H316'] 17.0
3 ['H320', 'H332', 'H316', 'H315', 'H304', 'H302', 'H312'] 18.0
4 ['H315', 'H319', 'H312', 'H316', 'H332'] 17.0
5 ['H312'] 17.0
6 ['Not Applicable'] nan
7 [''] nan
'''