Adding new customers in a dataframe after calculating their loyalty value-CodePudding

I have a dataset df which includes information on old customers and a list of customers that may be new or old. I would like to assign a loyalty value to new customers, i.e. those customers not evaluated yet in the dataset df. Please see below an example of what I would like to achieve. Let's say that I have a list of potential new customers (e.g., customers=['cust1', 'cust76']) whose information on their loyalty might be missing in the df dataset (called here existing_cust_df).

existing_cust_df=pd.read(path.csv)

The dataset has only two fields: one refers to customer's ID and the other one refers to the loyalty values (called just value).

Customer         value 
cust2             13
cust3             14
cust6             35
cust7             21
cust1             24

I would like to determine the loyalty of the new customer(s) (in this example will be only cust76, as cust1 is already in the dataset and we have information on its value - ie. 24) based on the value of existing customers included in the existing_cust_df dataset.

To determine the value of new customers, I use a distinct dataset, called extra_df, which includes the new customer and a list of potential customers that might be used to determine its loyalty value. The extra_df dataset looks like as follows:

   Customer         Related_customer
    cust76             cust32
    cust76             cust2
    cust76             cust1

The extra_df can then be used to look at related customers that might be included in the existing_cust_df in order to get their values from there and assign to the new customer by averaging those values. In the example I provided, cust76 is missing in the existing_cust_df so I look at the extra_df where, out of three customers related to it, there are only two (cust2 and cust1) that are in the existing_cust_df so their values. By averaging their values (13 and 24), I would like to assign this value (18.5) to the new customer (cust76) and append its ID (cust76) and its calculated value (18.5) to the existing_cust_df. So at the end I would have:

Customer         value 
cust2             13
cust3             14
cust6             35
cust7             21
cust1             24
cust76            18.5

I hope the steps above make sense; if they do not, please let me know. Can you please tell me how to run this check, looking at a customer in the list into the existing_cust_df, then in the extra_df if not included, calculating its average and finally update the value in the existing_cust_df? I know that to append a row in an existing dataframe I should do:

new_row = {'Customer': cust76, 'value': 18.5}
#append row to the dataframe
existing_cust_df = existing_cust_df.append(new_row, ignore_index=False)

But the difficulties in this case are to look at customers in the list that are not in the existing_cust_df and, then, at extra_df, in order to include them and their values once calculated the average (otherwise, it would be also good to assign a null value).

Any help would be great.

CodePudding user response：

Let's first build our dataframes :


csvfile = StringIO(
"""Customer\tvalue
cust2\t13
cust3\t14
cust6\t35
cust7\t21
cust1\t24""")
existing_cust_df = pd.read_csv(csvfile, sep = '\t', engine='python')

csvfile = StringIO("""
Customer\tRelated_customer
cust76\tcust32
cust76\tcust2
cust76\tcust1""")
extra_df = pd.read_csv(csvfile, sep = '\t', engine='python')

We want to loop for each input in extra_df, let's groupby :

dg = extra_df.groupby('Customer')

Then we can build a mask for each customer in extra_cust_df : True if a related_customer is in existing_cust_df.

dg.Related_customer.apply(lambda x : existing_cust_df.Customer.isin(x.to_numpy()))
Out[123]: 
Customer   
cust76    0     True
          1    False
          2    False
          3    False
          4     True
Name: Related_customer, dtype: bool

We can then apply this mask on existing_cust_df.value and then average it :


dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())"))

Out[125]: 
           Customer  value
Customer                  
cust76   0    cust2     13
         4    cust1     24


dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())").value.mean())

Out[126]: 
Customer
cust76    18.5
Name: Related_customer, dtype: float64

Finally we can append it to your existing df

CodePudding user response：

Here is a way to do what your question asks:

import pandas as pd
import numpy as np
# Objective:
#    for each customer in customers not already in existing_cust_df, 
#    add a record to existing_cust_df with value equal to 
#    the average of the values from existing_cust_df 
#    of Related_customer records in extra_df 
#    where extra_df.Customer == customer and Related_customer is in existing_cust_df.

existing_cust_df = pd.DataFrame({'Customer':['cust2','cust3','cust6','cust7','cust1'], 'value':[13,14,35,21,24]})
print("existing_cust_df:")
print(existing_cust_df)

extra_df = pd.DataFrame({'Customer':['cust76','cust76','cust76', 'cust77', 'cust77'], 'Related_customer':['cust32','cust2','cust1', 'cust2', 'cust6']})
print("extra_df:")
print(extra_df)

customers=['cust1', 'cust76', 'cust77']
new_cust_df = pd.DataFrame({'Customer': [c for c in customers if c not in set(existing_cust_df['Customer'])]})
print("new_cust_df:")
print(new_cust_df)

# create a record for each new_cust containing list of related_cust names 
#    with matching customer in extra_df where related_cust is a customer in existing_cust_df
existing_related_customer_values = \
    pd.merge( \
        pd.merge(extra_df, new_cust_df, on= 'Customer'), \
        existing_cust_df, left_on='Related_customer', right_on='Customer', suffixes=(None, '_y')) \
    [['Customer', 'Related_customer', 'value']]
print("existing_related_customer_values\n (JOIN from new_cust_df and extra_df on Customer [KEEP],\n then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])")
print(existing_related_customer_values)

# find the mean of values for existing Related_customer records for new customers
new_cust_df['value'] = new_cust_df.apply(lambda x: np.mean( \
    existing_related_customer_values[existing_related_customer_values['Customer'] == x['Customer']]['value']), axis=1)
print("new_cust_df (UPDATED with value column):")
print(new_cust_df)

existing_cust_df = pd.concat([existing_cust_df, new_cust_df], ignore_index=True)
print("existing_cust_df (UPDATED with new customer rows):")
print(existing_cust_df)

Output:

existing_cust_df:
  Customer  value
0    cust2     13
1    cust3     14
2    cust6     35
3    cust7     21
4    cust1     24
extra_df:
  Customer Related_customer
0   cust76           cust32
1   cust76            cust2
2   cust76            cust1
3   cust77            cust2
4   cust77            cust6
new_cust_df:
  Customer
0   cust76
1   cust77
existing_related_customer_values
 (JOIN from new_cust_df and extra_df on Customer [KEEP],
 then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])
  Customer Related_customer  value
0   cust76            cust2     13
1   cust77            cust2     13
2   cust76            cust1     24
3   cust77            cust6     35
new_cust_df (UPDATED with value column):
  Customer  value
0   cust76   18.5
1   cust77   24.0
existing_cust_df (UPDATED with new customer rows):
  Customer  value
0    cust2   13.0
1    cust3   14.0
2    cust6   35.0
3    cust7   21.0
4    cust1   24.0
5   cust76   18.5
6   cust77   24.0