I am currently working on an analysis of customers at a business. What I am trying to derive is how many customers purchased Product 1 before Product 2. I currently have data for all orders for the time period I am looking to analyze. I have it in a dataframe that is:
Customer ID | Order ID | Index of Order | Products on Order (concatenated into a string) |
---|---|---|---|
Customer 1 | Order 111 | 1 | Product 1, Product 3, Product 4 |
Customer 1 | Order 112 | 2 | Product 2 |
Customer 2 | Order 113 | 1 | Product 2, Product 4 |
Customer 2 | Order 114 | 2 | Product 1 |
The index of order
represents where the order falls within the lifespan of a customer, i.e. someone's first order, second order, etc. indexed as an integer starting at 1. I have an idea of how to set this up but I can't connect it to execution within Pandas/Python.
Essentially I want to use the logic:
- How many customers have an order containing Product 1 with a lower order index than an order containing Product 2.
Each record is an order. In the above table example I would like for Customer 1 to be counted, but not 2 because customer 1 purchased Product 1 before Product 2.
I don't really have any code right now outside of cleaning the data to get to a workable data set that I can attempt to derive these insights from. I started spinning my wheels and wanted to reach out here for ideas or solutions.
CodePudding user response:
Here is one approach, first create another DataFrame with exploded products, then use it to slice based on your targets:
df2 = (df
.assign(product=df['Products on Order'].str.split(', '))
.explode('product')
)
target = ('Product 1', 'Product 2')
s0 = df2.loc[df2['product'].eq(target[0])].set_index('Customer ID')['Index of Order']
s1 = df2.loc[df2['product'].eq(target[1])].set_index('Customer ID')['Index of Order']
s1.sub(s0).gt(0)
output:
Customer ID
Customer 1 True
Customer 2 False
Name: Index of Order, dtype: bool
intermediate df2
:
Customer ID Order ID Index of Order Products on Order product
0 Customer 1 Order 111 1 Product 1, Product 3, Product 4 Product 1
0 Customer 1 Order 111 1 Product 1, Product 3, Product 4 Product 3
0 Customer 1 Order 111 1 Product 1, Product 3, Product 4 Product 4
1 Customer 1 Order 112 2 Product 2 Product 2
2 Customer 2 Order 113 1 Product 2, Product 4 Product 2
2 Customer 2 Order 113 1 Product 2, Product 4 Product 4
3 Customer 2 Order 114 2 Product 1 Product 1
CodePudding user response:
Just build it up piece by piece.
total_customers = df['Customer ID'].nunique()
desired_customers = 0
first_product = 'Product 1'
second_product = 'Product 2'
for customer in df['Customer ID'].unique():
orders = df[df['Customer ID'] == customer]
order_contents = orders['Products on Order (concatenated into a string)'].sum()
if first_product in order_contents and second_product in order_contents:
first_prod_ind = orders[orders['Products on Order (concatenated into a string)'].str.contains(first_product)]['Index of Order'].iloc[0]
second_prod_ind = orders[orders['Products on Order (concatenated into a string)'].str.contains(second_product)]['Index of Order'].iloc[0]
if first_prod_ind < second_prod_ind:
desired_customers = 1
print(f"Customers who bought {first_product} first: {desired_customers}")
print(f"Total customers: {total_customers}")
Note this only counts customers if the first time they bought Product 1 was before the first time they bought Product 2.
CodePudding user response:
I was able to accomplish the desired results with the following code
df['Prod_1'] = np.where(df['Products on Order (concatenated into a string)'].str.contains('Product 1'), df['Index of Order'], 0)
df['Prod_2'] = np.where(df['Products on Order (concatenated into a string)'].str.contains('Product 2'), df['Index of Order'], 0)
df['Prod_1'] = df.groupby('Customer ID')['Prod_1'].transform(max)
df['Prod_2'] = df.groupby('Customer ID')['Prod_2'].transform(max)
df = df.loc[np.where(df['Prod_1'].lt(df['Prod_2'], True, False))]
df['Customer ID'].unique().tolist()