so I've the following sample dataset:
Column A: Name
Column B: Email
Column C: Products
Column D: Transaction Date
I've two objectives:
To determine bulk customers (customers who purchase, let's say, 5 products or more in a single transaction), where each row represents a unique transaction with a unique timestamp.
To determine from the recurring customers (customers frequently making different transactions), who all are also bulk customers.
Now, I've already determined the list of recurring customers as follows:
n = 15
custmost1 = Order_Details['Name'].value_counts().index.tolist()[:n]
custmost2 = Order_Details['Name'].value_counts().values.tolist()[:n]
custmost = np.column_stack((custmost1,custmost2))
Where custmost denotes the series tuple clubbed together as an array for customers making frequent purchases with their counts. Order_Details is the dataframe I created for the dataset.
Now, I'm at my wit ends to figure out to maintain a count of different products being purchased in a single transaction (with a unique timestamp), and possibly, add it as a separate column in a dataframe.
I don't know if it's a feasible approach or not, but two ways were coming to my mind:
- One to count the number of commas, so that number of commas 1 will be number of products.
- To segregate each products into a separate line (which I already did, by the way, for maintaining a total count for a different insight), and checking with timestamp the number of products sold at a given timestamp.
I'd segregated the Products as follows:
reshaped = \
(Order_Details.set_index(Order_Details.columns.drop('Product',1).tolist())
.Product.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Product'})
.loc[:, Order_Details.columns]
)
So, in this lieu, I would want someone to guide me as I feel like the aforementioned approaches are actually rather messy.
CodePudding user response:
Ok, so after a bit of extensive brainstorming, I've concocted the following way to do this:
In the original dataset's dataframe (Order_Details), I figured out to get the count of commas in each row of the Product column, which gave me a frequency of the number of products purchased in a single transaction. The code for that goes:
Order_Details['Number Of Products'] = Order_Details['Product'].str.count(",") 1
To make sure that I get the names of customers in a sorted order according to the frequency of purchases, I applied the following sort_values() function:
Dup_Order_Details = Order_Details
Dup_Order_Details.sort_values(["Number Of Products","Name"],axis=0, ascending=False,inplace=True,na_position='first')
Finally, a filter for those buying more than 'N' products (here, I took N=10, as I wanted this insight, y'all can take 'N' as input if you want):
Dup_Order_Details = Dup_Order_Details[Dup_Order_Details["Number Of Products"] >= 10]
Then a simple direct display can be done as per your need or you can convert it into a list or something, in case any visualization is needed (which I did).
CodePudding user response:
Assuming you already have a proper DataFrame
:
>>> df.applymap(lambda e: type(e).__name__).value_counts()
name email product date
str str list Timestamp 29
dtype: int64
(i.e., with columns: ['name', 'email', 'product', 'date']
, where the 'product'
column contains list
objects, and date
contains Timestamp
),
Then you could do this:
bulk_customers = set(df.loc[df['product'].apply(len) >= 5, 'name'])
s = df.groupby('name').size() > 1
recur_customers = set(s[s].index)
>>> bulk_customers
{'PERSON_108', 'PERSON_123'}
>>> recur_customers
{'PERSON_105'}
Notes
I changed the row of
PERSON_125
to bePERSON_105
, so that there would be one repeat customer. Likewise, I used a threshold ofn_visits > 1
as the criterion for "recurring", but you may want something different.You'd be well inspired to assign a unique ID to each of your customers. This could be based on email or perhaps you already have a customer ID. In any case, using
name
is prone to collisions, plus sometimes customers change name (e.g. through marriage) while keeping the same email or customer ID.You didn't mention over what period of time a customer needs to visit again in order to be considered "frequent". If that is to be considered, you have to be specific whether it is e.g. "within a calendar month", or "over the past 30 days", etc., as each leads to slightly different expressions.