I am working with sequential and frequent pattern mining. I was given this type of dataset to do the task, and I am told to make a sequence from the dataset before processing.
This is the sample data taken from dataset, in table format. The table in .csv format is available at: https://drive.google.com/file/d/1j1rEy4Q600y_oym23cG3m3NNWuNvIcgG/view?usp=sharing
User | Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | Item 6 |
---|---|---|---|---|---|---|
A | milk | cake | citrus | |||
B | cheese | milk | bread | cabbage | carrot | |
A | tea | juice | citrus | salmon | ||
B | apple | orange | ||||
B | cake |
At first, I think I have to make the csv file into Pandas Dataframe. I have no problem with that, what I want to ask is, how is it possible with dataframe to produce result like this?
Expected result 1, a group of items bought from 1 user is grouped into one tuple
User | Transactions |
---|---|
A | (milk cake citrus)(tea juice citrus salmon) |
B | (cheese milk bread cabbage carrot)(apple orange)(cake) |
Expected result 2, each item purchased by user is not grouped by one.
User | Transactions |
---|---|
A | milk, cake, citrus, tea, juice, citrus, salmon, |
B | cheese, milk, bread, cabbage, carrot, apple, orange, cake |
My question is, how to make those dataframe? I've tried a solution from this article: How to group dataframe rows into list in pandas groupby, but it is still not successful.
CodePudding user response:
In order to get the 1st result
out = df.set_index('User').apply(lambda x : tuple(x[x!=''].tolist()),axis=1).groupby(level=0).agg(list).reset_index(name='Transactions')
Out[95]:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
For the 2nd result which is more easy than the pervious one
df.set_index('User').replace('',np.nan).stack().groupby(level=0).agg(','.join)
Out[97]:
User
A milk,cake,citrus,tea,juice,citrus,salmon
B cheese,milk,bread,cabbage,carrot,apple,orange,...
dtype: object
CodePudding user response:
Let's start with the second one:
(df.set_index('User')
.stack()
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [milk, cake, citrus, tea, juice, citrus, salmon]
1 B [cheese, milk, bread, cabbage, carrot, apple, ...
To get the first one, on just need to add a new column:
(df.assign(group=df.groupby('User').cumcount())
.set_index(['User', 'group'])
.stack()
.groupby(level=[0,1]).apply(tuple)
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
CodePudding user response:
import pandas as pd
df = pd.read_csv('sampletable.csv')
df['Transactions'] = '(' df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=' '), axis=1) ')'
df = df.groupby(['User'])['Transactions'].apply(lambda x: ''.join(x)).reset_index()
print(df)
output:
User Transactions
0 A (milk cake citrus)(tea juice citrus salmon)
1 B (cheese milk bread cabbage carrot)(apple orange)(cake)
for the second output, use this:
df = pd.read_csv('sampletable.csv')
df['a'] = df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=', '), axis=1)
df = df.groupby(['User'])['a'].apply(lambda x: ', '.join(x)).reset_index()
print(df)