Python Dataframe : Groupby and Transform-CodePudding

I have a Python pandas dataframe (df1) with 3 columns: UserId, TimeStamp, Context. This dataframe represents logs of user actions within the app. There are multiple lines for each user.

I have to merge this dataframe with an other dataframe (df2) on UserId. That dataframe (df2) contains info about users (i.e. their birthdate, account creation date, etc.). The goal is to perform machine learning. I will have to create features using time between actions in df1 and df2. But before doing that, I want everything into 1 dataframe to make it easier (hopefully?)

My initial idea is to group df1 on UserId and transform TimeStamp and Context into an array that would have objects : { timestamp, context } and then merge this new df into df2.

Is this a good idea? If not, how would you go about it?

I was originally just going to ask help on grouping and aggregating my data in df1 into a new dataframe. Here is what I have so far, the result and the desired result.

Thank you

     Raw data
     df1 = df[['UserId', 'Timestamp', 'Context']]
    ----------------------------------------------------
        UserId    TimeStamp             Context
    1   618884    2015-12-18 11:29:33   410-513-FD-10-01
    2   618884    2015-12-18 11:29:38   540-823-RE-12-01
    3   620141    2015-12-18 12:29:02   101-901-RE-10-03
    4   620141    2015-12-18 12:29:34   Other
    ----------------------------------------------------
    Current code    
    gp = df1.groupby(['UserId']).agg({'UserId': 'first',
                                     'TimeStamp': ', '.join,})
    gp.head()
    ----------------------------------------------------
    Current output      
    UserId  UserId  Timestamp
    585953  585953  2016-04-01 11:08:06, 2016-04-01 11:08:10
    586182  586182  2016-05-01 15:53:01
    586400  586400  2015-12-27 12:31:11, 2015-12-27 12:31:44, 2015-12-29 14:35:38, 2015-12-29 14:35:45, 2016-01-10 01:42:28
    586573  586573  2016-01-12 14:40:41, 2016-04-19 19:56:41, 2016-04-19 19:56:55
    586716  586716  2016-01-13 13:50:05
    ----------------------------------------------------
    Desired result:
    UserId                       Actions
    1                            [ { 2016-04-01 19:40:32, 410-513-FD-10-01 }
                                   { 2017-05-03 20:34:21, 320-391-RE-12-03 } ]
    2                            [ { 2019-03-12 12:08:12, Other } ]

CodePudding user response：

You can't use groupby and agg or transform because the transformation applied for each column. You can't access to Context column while you are processing TimeStamp column. A classical way is to use groupby_apply:

out = df.groupby('UserId').apply(lambda x: list(zip(x['TimeStamp'], x['Context']))) \
        .rename('Actions').reset_index()
print(out)

# Output
   UserId                                            Actions
0  618884  [(2015-12-18 11:29:33, 410-513-FD-10-01), (201...
1  620141  [(2015-12-18 12:29:02, 101-901-RE-10-03), (201...

Note: if I was you, I don't do this operation, I will merge the 2 dataframes directly. Your column Actions is now too complicated to process for ML.