Home > Blockchain >  How to return the max value from another dataframe
How to return the max value from another dataframe

Time:11-01

I have two dataframes:

df1 is a reference table with a list of individual codes and their corresponding values.

df2 is a excerpt from a larger dataset, wherein one of the columns will contain multiple examples of the codes. It will also contain other values I want to ignore e.g. blanks and 'Not Applicable'.

I need to split out each individual code from df2 and find the corresponding value from the reference table df1. I then want to return a column in df2 with the maximum value from the entire string of codes.

import pandas as pd

df1 = [['H302',18],
       ['H312',17],
       ['H315',16],
       ['H316',15],
       ['H319',14],
       ['H320',13],
       ['H332',12],
       ['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])

df2 = [['H302,H304'],
       ['H332,H319,H312,H320,H316,H315,H302,H304'],
       ['H315,H312,H316'],
       ['H320,H332,H316,H315,H304,H302,H312'],
       ['H315,H319,H312,H316,H332'],
       ['H312'],
       ['Not Applicable'],
       ['']]
df2 = pd.DataFrame(df2, columns=['Code'])

I had previously used the following:


df3 = []
for i in range(len(df2)):
    df3.append(df2['Code'][i].split(","))

max_values = []
for i in range(len(df3)):
    for j in range(len(df3[i])):
        for index in range(len(df1)):
            if df1['Code'][index] == df3[i][j]:
                df3[i][j] = df1['Value'][index]
    max_values.append(max(df3[i]))            

df2["Max Value"] = max_values


However, the .append function is being removed and when used I get the following error "'>' not supported between instances of 'numpy.ndarray' and 'str'"

CodePudding user response:

Code

df2['max'] = (
    df2['Code']
    .str.split(',')
    .explode()
    .map(df1.set_index('Code')['Value'])
    .groupby(level=0).max()
)

How it works?

  • Split by delimiter ,
  • Explode to convert lists to rows
  • Use map to substitute values from df1
  • Groupby on level=0 to find max value per row group

Result

                                      Code   max
0                                H302,H304  18.0
1  H332,H319,H312,H320,H316,H315,H302,H304  18.0
2                           H315,H312,H316  17.0
3       H320,H332,H316,H315,H304,H302,H312  18.0
4                 H315,H319,H312,H316,H332  17.0
5                                     H312  17.0
6                           Not Applicable   NaN
7                                            NaN

CodePudding user response:

One can use pandas.Series.apply with a custom lambda function as follows

df2['Max_Value'] = df2['Code'].apply(lambda x: max([df1.loc[df1['Code'] == i, 'Value'].values[0] for i in x.split(',') if i != 'Not Applicable' and i != ''], default=0))

[Out]:

                                      Code  Max_Value
0                                H302,H304         18
1  H332,H319,H312,H320,H316,H315,H302,H304         18
2                           H315,H312,H316         17
3       H320,H332,H316,H315,H304,H302,H312         18
4                 H315,H319,H312,H316,H332         17
5                                     H312         17
6                           Not Applicable          0
7                                                   0

Given the first note below, if one doesn't want to use .apply(), one can use a list comprehension as follows

df2['Max_Value'] = [max([df1.loc[df1['Code'] == i, 'Value'].values[0] for i in x.split(',') if i != 'Not Applicable' and i != ''], default=0) for x in df2['Code']]

[Out]:

                                      Code  Max_Value
0                                H302,H304         18
1  H332,H319,H312,H320,H316,H315,H302,H304         18
2                           H315,H312,H316         17
3       H320,H332,H316,H315,H304,H302,H312         18
4                 H315,H319,H312,H316,H332         17
5                                     H312         17
6                           Not Applicable          0
7                                                   0

Notes:

CodePudding user response:

As an alternative:

keys=df1['Code'].to_list()
df2["Code"] = df2["Code"].str.split(',')

def get_max(x):
    max_=[]
    for i in x:
        if i in keys:
            max_.append(df1.loc[df1.Code==i, 'Value'].values[0])
        else:
            pass
    if len(max_)>0:
        return max(max_)
    else:
        []
            
df2['max_value']=df2['Code'].apply(lambda x:get_max(x))
print(df2)
'''
    Code                                                                max_value
0   ['H302', 'H304']                                                    18.0
1   ['H332', 'H319', 'H312', 'H320', 'H316', 'H315', 'H302', 'H304']    18.0
2   ['H315', 'H312', 'H316']                                            17.0
3   ['H320', 'H332', 'H316', 'H315', 'H304', 'H302', 'H312']            18.0
4   ['H315', 'H319', 'H312', 'H316', 'H332']                            17.0
5   ['H312']                                                            17.0
6   ['Not Applicable']                                                  nan
7   ['']                                                                nan

'''

  • Related