I have dataset with millions of rows, here is an example of what it looks like and what I intend to output:
data = [[1, 100, 8], [1, 100, 4],
[1, 100,6], [2, 100, 0],
[2, 200, 1], [3, 300, 7],
[4, 400, 2], [5, 100, 6],
[5, 100, 3], [5, 600, 1]]
df= pd.DataFrame(data, columns =['user', 'time', 'item'])
print(df)
user time item
1 100 8
1 100 4
1 100 6
2 100 0
2 200 1
3 300 7
4 400 2
5 100 6
5 100 3
5 600 1
The desired output should have all items consumed by a user within the same time to appear together in the items
column as follows
user time item
1 100 8,4,6
2 100 0
5 100 6,3
2 200 1
3 300 7
4 400 2
5 500 6
For example, user: 1
consumed products 8,4,6
within time: 100
How could this be achieved?
CodePudding user response:
Use df.astype
with Groupby.agg
and df.sort_values
:
In [489]: out = df.astype(str).groupby(['user', 'time'])['item'].agg(','.join).reset_index().sort_values('time')
In [490]: out
Out[490]:
user time item
0 1 100 8,4,6
1 2 100 0
5 5 100 6,3
2 2 200 1
3 3 300 7
4 4 400 2
6 5 600 1