Home > Enterprise >  Pandas - Lookup value for each item in list
Pandas - Lookup value for each item in list

Time:03-02

I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.

This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.

I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!

import pandas as pd

df1 = [['H302',18],
       ['H312',17],
       ['H315',16],
       ['H316',15],
       ['H319',14],
       ['H320',13],
       ['H332',12],
       ['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])

df2 = [['H302,H304'],
       ['H332,H319,H312,H320,H316,H315,H302,H304'],
       ['H315,H312,H316'],
       ['H320,H332,H316,H315,H304,H302,H312'],
       ['H315,H319,H312,H316,H332'],
       ['H312'],
       ['Not Applicable'],
       ['']]
df2 = pd.DataFrame(df2, columns=['Code'])

CodePudding user response:

First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.

Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:

df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')

Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.

result = []
for codes in df2:
    c = [_ for _ in codes if _ in df1.index]
    result.append(df1.loc[c,'Value'].max())

CodePudding user response:

Try:

df2.join(df2['Code'].str.split(',')\
   .explode()\
   .map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))

Output:

                                      Code  Value
0                                H302,H304   18.0
1  H332,H319,H312,H320,H316,H315,H302,H304   18.0
2                           H315,H312,H316   17.0
3       H320,H332,H316,H315,H304,H302,H312   18.0
4                 H315,H319,H312,H316,H332   17.0
5                                     H312   17.0
6                           Not Applicable    NaN
7                                             NaN
  • Related