I have a dataset df which includes information on old customers and a list of customers that may be new or old. I would like to assign a loyalty value to new customers, i.e. those customers not evaluated yet in the dataset df.
Please see below an example of what I would like to achieve.
Let's say that I have a list of potential new customers (e.g., customers=['cust1', 'cust76']
) whose information on their loyalty might be missing in the df dataset (called here existing_cust_df
).
existing_cust_df=pd.read(path.csv)
The dataset has only two fields: one refers to customer's ID and the other one refers to the loyalty values (called just value
).
Customer value
cust2 13
cust3 14
cust6 35
cust7 21
cust1 24
I would like to determine the loyalty of the new customer(s) (in this example will be only cust76
, as cust1
is already in the dataset and we have information on its value - ie. 24) based on the value of existing customers included in the existing_cust_df
dataset.
To determine the value of new customers, I use a distinct dataset, called extra_df
, which includes the new customer and a list of potential customers that might be used to determine its loyalty value. The extra_df
dataset looks like as follows:
Customer Related_customer
cust76 cust32
cust76 cust2
cust76 cust1
The extra_df can then be used to look at related customers that might be included in the existing_cust_df
in order to get their values from there and assign to the new customer by averaging those values.
In the example I provided, cust76
is missing in the existing_cust_df
so I look at the extra_df
where, out of three customers related to it, there are only two (cust2
and cust1
) that are in the existing_cust_df
so their values.
By averaging their values (13
and 24
), I would like to assign this value (18.5
) to the new customer (cust76
) and append its ID (cust76
) and its calculated value (18.5
) to the existing_cust_df
.
So at the end I would have:
Customer value
cust2 13
cust3 14
cust6 35
cust7 21
cust1 24
cust76 18.5
I hope the steps above make sense; if they do not, please let me know.
Can you please tell me how to run this check, looking at a customer in the list into the existing_cust_df
, then in the extra_df
if not included, calculating its average and finally update the value in the existing_cust_df
?
I know that to append a row in an existing dataframe I should do:
new_row = {'Customer': cust76, 'value': 18.5}
#append row to the dataframe
existing_cust_df = existing_cust_df.append(new_row, ignore_index=False)
But the difficulties in this case are to look at customers in the list that are not in the existing_cust_df
and, then, at extra_df
, in order to include them and their values once calculated the average (otherwise, it would be also good to assign a null value).
Any help would be great.
CodePudding user response:
Let's first build our dataframes :
csvfile = StringIO(
"""Customer\tvalue
cust2\t13
cust3\t14
cust6\t35
cust7\t21
cust1\t24""")
existing_cust_df = pd.read_csv(csvfile, sep = '\t', engine='python')
csvfile = StringIO("""
Customer\tRelated_customer
cust76\tcust32
cust76\tcust2
cust76\tcust1""")
extra_df = pd.read_csv(csvfile, sep = '\t', engine='python')
We want to loop for each input in extra_df
, let's groupby :
dg = extra_df.groupby('Customer')
Then we can build a mask for each customer in extra_cust_df : True
if a related_customer is in existing_cust_df.
dg.Related_customer.apply(lambda x : existing_cust_df.Customer.isin(x.to_numpy()))
Out[123]:
Customer
cust76 0 True
1 False
2 False
3 False
4 True
Name: Related_customer, dtype: bool
We can then apply this mask on existing_cust_df.value and then average it :
dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())"))
Out[125]:
Customer value
Customer
cust76 0 cust2 13
4 cust1 24
dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())").value.mean())
Out[126]:
Customer
cust76 18.5
Name: Related_customer, dtype: float64
Finally we can append it to your existing df
CodePudding user response:
Here is a way to do what your question asks:
import pandas as pd
import numpy as np
# Objective:
# for each customer in customers not already in existing_cust_df,
# add a record to existing_cust_df with value equal to
# the average of the values from existing_cust_df
# of Related_customer records in extra_df
# where extra_df.Customer == customer and Related_customer is in existing_cust_df.
existing_cust_df = pd.DataFrame({'Customer':['cust2','cust3','cust6','cust7','cust1'], 'value':[13,14,35,21,24]})
print("existing_cust_df:")
print(existing_cust_df)
extra_df = pd.DataFrame({'Customer':['cust76','cust76','cust76', 'cust77', 'cust77'], 'Related_customer':['cust32','cust2','cust1', 'cust2', 'cust6']})
print("extra_df:")
print(extra_df)
customers=['cust1', 'cust76', 'cust77']
new_cust_df = pd.DataFrame({'Customer': [c for c in customers if c not in set(existing_cust_df['Customer'])]})
print("new_cust_df:")
print(new_cust_df)
# create a record for each new_cust containing list of related_cust names
# with matching customer in extra_df where related_cust is a customer in existing_cust_df
existing_related_customer_values = \
pd.merge( \
pd.merge(extra_df, new_cust_df, on= 'Customer'), \
existing_cust_df, left_on='Related_customer', right_on='Customer', suffixes=(None, '_y')) \
[['Customer', 'Related_customer', 'value']]
print("existing_related_customer_values\n (JOIN from new_cust_df and extra_df on Customer [KEEP],\n then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])")
print(existing_related_customer_values)
# find the mean of values for existing Related_customer records for new customers
new_cust_df['value'] = new_cust_df.apply(lambda x: np.mean( \
existing_related_customer_values[existing_related_customer_values['Customer'] == x['Customer']]['value']), axis=1)
print("new_cust_df (UPDATED with value column):")
print(new_cust_df)
existing_cust_df = pd.concat([existing_cust_df, new_cust_df], ignore_index=True)
print("existing_cust_df (UPDATED with new customer rows):")
print(existing_cust_df)
Output:
existing_cust_df:
Customer value
0 cust2 13
1 cust3 14
2 cust6 35
3 cust7 21
4 cust1 24
extra_df:
Customer Related_customer
0 cust76 cust32
1 cust76 cust2
2 cust76 cust1
3 cust77 cust2
4 cust77 cust6
new_cust_df:
Customer
0 cust76
1 cust77
existing_related_customer_values
(JOIN from new_cust_df and extra_df on Customer [KEEP],
then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])
Customer Related_customer value
0 cust76 cust2 13
1 cust77 cust2 13
2 cust76 cust1 24
3 cust77 cust6 35
new_cust_df (UPDATED with value column):
Customer value
0 cust76 18.5
1 cust77 24.0
existing_cust_df (UPDATED with new customer rows):
Customer value
0 cust2 13.0
1 cust3 14.0
2 cust6 35.0
3 cust7 21.0
4 cust1 24.0
5 cust76 18.5
6 cust77 24.0