I have a Python pandas dataframe (df1) with 3 columns: UserId, TimeStamp, Context. This dataframe represents logs of user actions within the app. There are multiple lines for each user.
I have to merge this dataframe with an other dataframe (df2) on UserId. That dataframe (df2) contains info about users (i.e. their birthdate, account creation date, etc.). The goal is to perform machine learning. I will have to create features using time between actions in df1 and df2. But before doing that, I want everything into 1 dataframe to make it easier (hopefully?)
My initial idea is to group df1 on UserId and transform TimeStamp and Context into an array that would have objects : { timestamp, context } and then merge this new df into df2.
Is this a good idea? If not, how would you go about it?
I was originally just going to ask help on grouping and aggregating my data in df1 into a new dataframe. Here is what I have so far, the result and the desired result.
Thank you
Raw data
df1 = df[['UserId', 'Timestamp', 'Context']]
----------------------------------------------------
UserId TimeStamp Context
1 618884 2015-12-18 11:29:33 410-513-FD-10-01
2 618884 2015-12-18 11:29:38 540-823-RE-12-01
3 620141 2015-12-18 12:29:02 101-901-RE-10-03
4 620141 2015-12-18 12:29:34 Other
----------------------------------------------------
Current code
gp = df1.groupby(['UserId']).agg({'UserId': 'first',
'TimeStamp': ', '.join,})
gp.head()
----------------------------------------------------
Current output
UserId UserId Timestamp
585953 585953 2016-04-01 11:08:06, 2016-04-01 11:08:10
586182 586182 2016-05-01 15:53:01
586400 586400 2015-12-27 12:31:11, 2015-12-27 12:31:44, 2015-12-29 14:35:38, 2015-12-29 14:35:45, 2016-01-10 01:42:28
586573 586573 2016-01-12 14:40:41, 2016-04-19 19:56:41, 2016-04-19 19:56:55
586716 586716 2016-01-13 13:50:05
----------------------------------------------------
Desired result:
UserId Actions
1 [ { 2016-04-01 19:40:32, 410-513-FD-10-01 }
{ 2017-05-03 20:34:21, 320-391-RE-12-03 } ]
2 [ { 2019-03-12 12:08:12, Other } ]
CodePudding user response:
You can't use groupby
and agg
or transform
because the transformation applied for each column. You can't access to Context
column while you are processing TimeStamp
column. A classical way is to use groupby_apply
:
out = df.groupby('UserId').apply(lambda x: list(zip(x['TimeStamp'], x['Context']))) \
.rename('Actions').reset_index()
print(out)
# Output
UserId Actions
0 618884 [(2015-12-18 11:29:33, 410-513-FD-10-01), (201...
1 620141 [(2015-12-18 12:29:02, 101-901-RE-10-03), (201...
Note: if I was you, I don't do this operation, I will merge the 2 dataframes directly. Your column Actions
is now too complicated to process for ML.