Analysis of Customer Data using Pandas Dataframes-CodePudding

I am currently working on an analysis of customers at a business. What I am trying to derive is how many customers purchased Product 1 before Product 2. I currently have data for all orders for the time period I am looking to analyze. I have it in a dataframe that is:

Customer ID	Order ID	Index of Order	Products on Order (concatenated into a string)
Customer 1	Order 111	1	Product 1, Product 3, Product 4
Customer 1	Order 112	2	Product 2
Customer 2	Order 113	1	Product 2, Product 4
Customer 2	Order 114	2	Product 1

The index of order represents where the order falls within the lifespan of a customer, i.e. someone's first order, second order, etc. indexed as an integer starting at 1. I have an idea of how to set this up but I can't connect it to execution within Pandas/Python.

Essentially I want to use the logic:

How many customers have an order containing Product 1 with a lower order index than an order containing Product 2.

Each record is an order. In the above table example I would like for Customer 1 to be counted, but not 2 because customer 1 purchased Product 1 before Product 2.

I don't really have any code right now outside of cleaning the data to get to a workable data set that I can attempt to derive these insights from. I started spinning my wheels and wanted to reach out here for ideas or solutions.

CodePudding user response：

Here is one approach, first create another DataFrame with exploded products, then use it to slice based on your targets:

df2 = (df
 .assign(product=df['Products on Order'].str.split(', '))
 .explode('product')
)

target = ('Product 1', 'Product 2')
s0 = df2.loc[df2['product'].eq(target[0])].set_index('Customer ID')['Index of Order']
s1 = df2.loc[df2['product'].eq(target[1])].set_index('Customer ID')['Index of Order']
s1.sub(s0).gt(0)

output:

Customer ID
Customer 1     True
Customer 2    False
Name: Index of Order, dtype: bool

intermediate df2:

  Customer ID   Order ID  Index of Order                Products on Order    product
0  Customer 1  Order 111               1  Product 1, Product 3, Product 4  Product 1
0  Customer 1  Order 111               1  Product 1, Product 3, Product 4  Product 3
0  Customer 1  Order 111               1  Product 1, Product 3, Product 4  Product 4
1  Customer 1  Order 112               2                        Product 2  Product 2
2  Customer 2  Order 113               1             Product 2, Product 4  Product 2
2  Customer 2  Order 113               1             Product 2, Product 4  Product 4
3  Customer 2  Order 114               2                        Product 1  Product 1

CodePudding user response：

Just build it up piece by piece.

total_customers = df['Customer ID'].nunique()

desired_customers = 0
first_product = 'Product 1'
second_product = 'Product 2'

for customer in df['Customer ID'].unique():
    orders = df[df['Customer ID'] == customer]
    order_contents = orders['Products on Order (concatenated into a string)'].sum()

    if first_product in order_contents and second_product in order_contents:
        first_prod_ind = orders[orders['Products on Order (concatenated into a string)'].str.contains(first_product)]['Index of Order'].iloc[0]
        second_prod_ind = orders[orders['Products on Order (concatenated into a string)'].str.contains(second_product)]['Index of Order'].iloc[0]
        
        if first_prod_ind < second_prod_ind:
            desired_customers  = 1

print(f"Customers who bought {first_product} first: {desired_customers}")
print(f"Total customers: {total_customers}")

Note this only counts customers if the first time they bought Product 1 was before the first time they bought Product 2.

CodePudding user response：

I was able to accomplish the desired results with the following code

df['Prod_1'] = np.where(df['Products on Order (concatenated into a string)'].str.contains('Product 1'), df['Index of Order'], 0)
df['Prod_2'] = np.where(df['Products on Order (concatenated into a string)'].str.contains('Product 2'), df['Index of Order'], 0)
df['Prod_1'] = df.groupby('Customer ID')['Prod_1'].transform(max)
df['Prod_2'] = df.groupby('Customer ID')['Prod_2'].transform(max)
df = df.loc[np.where(df['Prod_1'].lt(df['Prod_2'], True, False))]
df['Customer ID'].unique().tolist()