Cross referencing dataframes to extract specific values in python-CodePudding

Please assist in creating a python function. I have two dataframes, DF1 and DF2. I'd like to add a column to DF1, DF1['Score'], which is based on values contained in DF1 that match the values in DF2.

DF1:

DF2:

import pandas as pd
DF1 = pd.DataFrame({
     'Age':[25, 54, 33],
     'Income' :[10203, 23822, 84823],
     'Contract Length':[18, 12, 36],
     #'Score':[]
          })

DF2 = pd.DataFrame({
     'variable':['Age', 'Age', 'Age', 'Age',
                 'Income', 'Income', 'Income', 'Income',
                 'Contract Length', 'Contract Length', 'Contract Length', 'Contract Length'],
     'LQ':[ 25, 32.25, 39.5, 46.75, 10203, 28858, 47513, 66168, 12, 18, 24, 30],
     'UQ':[ 32.25, 39.5, 46.75, 54, 28858, 47513, 66168, 84823, 18, 24, 30, 36],
     'Score':[5, 10, 15, 20, 10, 15, 20, 25, 15, 20, 25, 30]
          })

Take customer UID 1 in DF1, since he is 25, with an income of 10,203 and contract length of 18; based on DF2 I want to be able to add a score of 30 to DF1['Score'] for customer 1, calculated as 5 (for age 25 to 32.5) 10 (for income 10,2013 to 28,858) 15 (for contract length of 12 to 18).

Please assist in creating a python function to add the correct scores to DF1['Score'] for all customers.

CodePudding user response：

You can use pandas pandas.DataFrame.apply to iterate through the rows from first dataframe and get the matching condition rows from the second dataframe.

Creating data

dict1 = {'customer UID': {0: 1, 1: 2, 2: 3}, 'Age': {0: 25, 1: 54, 2: 33}, 'Income': {0: 10203, 1: 23822, 2: 84823}, 'Contract Length': {0: 18, 1: 12, 2: 36}, 'Score': {0: '', 1: '', 2: ''}}

dict2 = {'variable': {0: 'Age', 1: 'Age', 2: 'Age', 3: 'Age', 4: 'Income', 5: 'Income', 6: 'Income', 7: 'Income', 8: 'Contract Length', 9: 'Contract Length', 10: 'Contract Length', 11: 'Contract Length'}, 'LQ': {0: 25.0, 1: 32.25, 2: 39.5, 3: 46.75, 4: 10203.0, 5: 28858.0, 6: 17513.0, 7: 66168.0, 8: 12.0, 9: 18.0, 10: 24.0, 11: 30.0}, 'UQ': {0: 32.25, 1: 39.5, 2: 46.75, 3: 54.0, 4: 28858.0, 5: 47513.0, 6: 66168.0, 7: 84823.0, 8: 18.0, 9: 24.0, 10: 30.0, 11: 36.0}, 'Score': {0: 5, 1: 10, 2: 15, 3: 20, 4: 10, 5: 15, 6: 20, 7: 25, 8: 15, 9: 20, 10: 25, 11: 30}}

df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)

Generating output

def get_values(row):
    age_condition = (row.Age >= df2['LQ']) & (row.Age <= df2['UQ']) & (df2.variable == 'Age')
    income_condition = (row.Income >= df2['LQ']) & (row.Income <= df2['UQ']) & (df2.variable == 'Income')
    contract_condition = (row['Contract Length'] >= df2['LQ']) & (row['Contract Length'] <= df2['UQ']) & (df2.variable == 'Contract Length')
    return df2[age_condition].Score.values[0]   df2[income_condition].Score.values[0]   df2[contract_condition].Score.values[0]

df1['Score'] = df1.apply(get_values, axis=1)

Output :

This gives us :

df1
   customer UID  Age  Income  Contract Length  Score
0             1   25   10203               18     30
1             2   54   23822               12     45
2             3   33   84823               36     65

CodePudding user response：

For efficiency, you need to use a merge_asof after performing a melt on DF1:

DF1['Score'] = (pd
 .merge_asof(DF1.astype(float).reset_index().melt('index').sort_values(by='value'),
             DF2.sort_values(by='UQ'),
             by='variable', left_on='value', right_on='UQ', direction='forward'
             )
 .groupby('index')['Score'].sum()
)

output:

   Age  Income  Contract Length  Score
0   25   10203               18   30.0
1   54   23822               12   45.0
2   33   84823               36   65.0

intermediates:

# reshape DF1 to long form
DF1.astype(float).reset_index().melt('index').sort_values(by='value')

   index         variable    value
7      1  Contract Length     12.0
6      0  Contract Length     18.0
0      0              Age     25.0
2      2              Age     33.0
8      2  Contract Length     36.0
1      1              Age     54.0
3      0           Income  10203.0
4      1           Income  23822.0
5      2           Income  84823.0

# merge asof with DF2 (i.e. find the closest UQ value greater that the target

(pd
 .merge_asof(DF1.astype(float).reset_index().melt('index').sort_values(by='value'),
             DF2.sort_values(by='UQ'),
             by='variable', left_on='value', right_on='UQ', direction='forward'
             )
)

   index         variable    value        LQ        UQ  Score
0      1  Contract Length     12.0     12.00     18.00     15
1      0  Contract Length     18.0     12.00     18.00     15
2      0              Age     25.0     25.00     32.25      5
3      2              Age     33.0     32.25     39.50     10
4      2  Contract Length     36.0     30.00     36.00     30
5      1              Age     54.0     46.75     54.00     20
6      0           Income  10203.0  10203.00  28858.00     10
7      1           Income  23822.0  10203.00  28858.00     10
8      2           Income  84823.0  66168.00  84823.00     25

# finally, group by "index" and sum the Score