How to combine data from one row in dataframe?-CodePudding

I am working with sequential and frequent pattern mining. I was given this type of dataset to do the task, and I am told to make a sequence from the dataset before processing.

This is the sample data taken from dataset, in table format. The table in .csv format is available at: https://drive.google.com/file/d/1j1rEy4Q600y_oym23cG3m3NNWuNvIcgG/view?usp=sharing

User	Item 1	Item 2	Item 3	Item 4	Item 5
A	milk	cake	citrus
B	cheese	milk	bread	cabbage	carrot
A	tea	juice	citrus	salmon
B	apple	orange
B	cake

At first, I think I have to make the csv file into Pandas Dataframe. I have no problem with that, what I want to ask is, how is it possible with dataframe to produce result like this?

Expected result 1, a group of items bought from 1 user is grouped into one tuple

User	Transactions
A	(milk cake citrus)(tea juice citrus salmon)
B	(cheese milk bread cabbage carrot)(apple orange)(cake)

Expected result 2, each item purchased by user is not grouped by one.

User	Transactions
A	milk, cake, citrus, tea, juice, citrus, salmon,
B	cheese, milk, bread, cabbage, carrot, apple, orange, cake

My question is, how to make those dataframe? I've tried a solution from this article: How to group dataframe rows into list in pandas groupby, but it is still not successful.

CodePudding user response：

In order to get the 1st result

out = df.set_index('User').apply(lambda x : tuple(x[x!=''].tolist()),axis=1).groupby(level=0).agg(list).reset_index(name='Transactions')
Out[95]: 
  User                                       Transactions
0    A  [(milk, cake, citrus), (tea, juice, citrus, sa...
1    B  [(cheese, milk, bread, cabbage, carrot), (appl...

For the 2nd result which is more easy than the pervious one

df.set_index('User').replace('',np.nan).stack().groupby(level=0).agg(','.join)
Out[97]: 
User
A             milk,cake,citrus,tea,juice,citrus,salmon
B    cheese,milk,bread,cabbage,carrot,apple,orange,...
dtype: object

CodePudding user response：

Let's start with the second one:

(df.set_index('User')
   .stack()
   .groupby(level=0).apply(list)
   .rename('Transactions')
   .reset_index()
)

output:

  User                                       Transactions
0    A   [milk, cake, citrus, tea, juice, citrus, salmon]
1    B  [cheese, milk, bread, cabbage, carrot, apple, ...

To get the first one, on just need to add a new column:

(df.assign(group=df.groupby('User').cumcount())
   .set_index(['User', 'group'])
   .stack()
   .groupby(level=[0,1]).apply(tuple)
   .groupby(level=0).apply(list)
   .rename('Transactions')
   .reset_index()
)

output:

  User                                       Transactions
0    A  [(milk, cake, citrus), (tea, juice, citrus, sa...
1    B  [(cheese, milk, bread, cabbage, carrot), (appl...

CodePudding user response：

import pandas as pd
df = pd.read_csv('sampletable.csv')

df['Transactions'] = '('   df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=' '), axis=1)   ')'

df = df.groupby(['User'])['Transactions'].apply(lambda x: ''.join(x)).reset_index()

print(df)

output:

  User                                                  Transactions
0    A        (milk cake citrus)(tea juice citrus salmon)
1    B  (cheese milk bread cabbage carrot)(apple orange)(cake)

for the second output, use this:

df = pd.read_csv('sampletable.csv')

df['a'] = df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=', '), axis=1)

df = df.groupby(['User'])['a'].apply(lambda x: ', '.join(x)).reset_index()

print(df)