Home > OS >  Analysis of Customer Data using Pandas Dataframes
Analysis of Customer Data using Pandas Dataframes

Time:06-22

I am currently working on an analysis of customers at a business. What I am trying to derive is how many customers purchased Product 1 before Product 2. I currently have data for all orders for the time period I am looking to analyze. I have it in a dataframe that is:

Customer ID Order ID Index of Order Products on Order (concatenated into a string)
Customer 1 Order 111 1 Product 1, Product 3, Product 4
Customer 1 Order 112 2 Product 2
Customer 2 Order 113 1 Product 2, Product 4
Customer 2 Order 114 2 Product 1

The index of order represents where the order falls within the lifespan of a customer, i.e. someone's first order, second order, etc. indexed as an integer starting at 1. I have an idea of how to set this up but I can't connect it to execution within Pandas/Python.

Essentially I want to use the logic:

  • How many customers have an order containing Product 1 with a lower order index than an order containing Product 2.

Each record is an order. In the above table example I would like for Customer 1 to be counted, but not 2 because customer 1 purchased Product 1 before Product 2.

I don't really have any code right now outside of cleaning the data to get to a workable data set that I can attempt to derive these insights from. I started spinning my wheels and wanted to reach out here for ideas or solutions.

CodePudding user response:

Here is one approach, first create another DataFrame with exploded products, then use it to slice based on your targets:

df2 = (df
 .assign(product=df['Products on Order'].str.split(', '))
 .explode('product')
)

target = ('Product 1', 'Product 2')
s0 = df2.loc[df2['product'].eq(target[0])].set_index('Customer ID')['Index of Order']
s1 = df2.loc[df2['product'].eq(target[1])].set_index('Customer ID')['Index of Order']
s1.sub(s0).gt(0)

output:

Customer ID
Customer 1     True
Customer 2    False
Name: Index of Order, dtype: bool

intermediate df2:

  Customer ID   Order ID  Index of Order                Products on Order    product
0  Customer 1  Order 111               1  Product 1, Product 3, Product 4  Product 1
0  Customer 1  Order 111               1  Product 1, Product 3, Product 4  Product 3
0  Customer 1  Order 111               1  Product 1, Product 3, Product 4  Product 4
1  Customer 1  Order 112               2                        Product 2  Product 2
2  Customer 2  Order 113               1             Product 2, Product 4  Product 2
2  Customer 2  Order 113               1             Product 2, Product 4  Product 4
3  Customer 2  Order 114               2                        Product 1  Product 1

CodePudding user response:

Just build it up piece by piece.

total_customers = df['Customer ID'].nunique()

desired_customers = 0
first_product = 'Product 1'
second_product = 'Product 2'

for customer in df['Customer ID'].unique():
    orders = df[df['Customer ID'] == customer]
    order_contents = orders['Products on Order (concatenated into a string)'].sum()

    if first_product in order_contents and second_product in order_contents:
        first_prod_ind = orders[orders['Products on Order (concatenated into a string)'].str.contains(first_product)]['Index of Order'].iloc[0]
        second_prod_ind = orders[orders['Products on Order (concatenated into a string)'].str.contains(second_product)]['Index of Order'].iloc[0]
        
        if first_prod_ind < second_prod_ind:
            desired_customers  = 1

print(f"Customers who bought {first_product} first: {desired_customers}")
print(f"Total customers: {total_customers}")

Note this only counts customers if the first time they bought Product 1 was before the first time they bought Product 2.

CodePudding user response:

I was able to accomplish the desired results with the following code

df['Prod_1'] = np.where(df['Products on Order (concatenated into a string)'].str.contains('Product 1'), df['Index of Order'], 0)
df['Prod_2'] = np.where(df['Products on Order (concatenated into a string)'].str.contains('Product 2'), df['Index of Order'], 0)
df['Prod_1'] = df.groupby('Customer ID')['Prod_1'].transform(max)
df['Prod_2'] = df.groupby('Customer ID')['Prod_2'].transform(max)
df = df.loc[np.where(df['Prod_1'].lt(df['Prod_2'], True, False))]
df['Customer ID'].unique().tolist()
  • Related