How could I create this dictionary who's values are coming from a DataFrame faster?-CodePudding

I have this dataframe:

event	attendees	duration
meeting	[{"id":1, "email": "email1"}, {"id":2, "email": "email2"}]	3600
lunch	[{"id":2, "email": "email2"}, {"id":3, "email": "email3"}]	7200

Which I am trying to turn into this dictionary:

{
    email1: {
       'num_events_with': 1,
       'duration_of_events': 3600,
    },
    email2: {
       'num_events_with': 2,
       'duration_of_events': 10,800,
    },
    email3: {
       'num_events_with': 1,
       'duration_of_events': 7200,
    },
}

except in my case the dataframe has thousands of rows and the dictionary I'm creating uses multiple columns to get the results for the dictionary keys so I need to be able to access the information that is relevant to each user email while creating the dictionary.

The purpose of the dictionary is to give information about the people the user has been in events with. So the first dictionary key is saying that the user has been in 1 event with email1 which lasted 3600 seconds.

Here is my approach to getting this dictionary:

# need to sort because I use diff() later
df.sort_values(by='startTime', ascending=True, inplace=True)

# a list of all contacts (emails) that have been in events with user
contacts = contacts_in_events
contact_info_dict = {}
df['attendees_str'] = df['attendees'].astype(str)

for contact in contacts:
    temp_df = df[df['attendees_str'].str.contains(contact)]
    duration_of_events = temp_df['duration'].sum()
    num_events_with = len(temp_df.index)

    contact_info_dict[contact] = {
        'duration_of_events': duration_of_events,
        'num_events_with': num_events_with
    }

but this is too slow. Any ideas for how to do this in a different way that would be faster?

This is the output of the actual dataframe .to_dict('records'):

{
 'creator': {
            'displayName': None,
            'email': 'creator of event',
            'id': None,
            'self': None
  },
  'start': {
            'date': None,
            'dateTime': '2022-09-13T12:30:00-04:00',
            'timeZone': 'America/Toronto'
  },
  'end': {
          'date': None,
          'dateTime': '2022-09-13T13:00:00-04:00',
          'timeZone': 'America/Toronto'
   },
  'attendees': [
       {
        'comment': None,
        'displayName': None,
        'email1': '[email protected]',
        'responseStatus': 'accepted'
       },
       {
        'comment': None,
        'displayName': None,
        'email': '[email protected]',
        'responseStatus': 'accepted'
       }
  ],
  'summary': 'One on One Meeting',
  'description': '...',
  'calendarType': 'work',
  'startTime': Timestamp('2022-09-13 16:30:00 0000', tz='UTC'),
  'endTime': Timestamp('2022-09-13 17:00:00 0000', tz='UTC'),
  'eventDuration': 1800.0,
  'dowStart': 1.0,
  'endStart': 1.0,
  'weekday': True,
  'startTOD': 59400,
  'endTOD': 61200,
  'day': Period('2022-09-13', 'D')
}

CodePudding user response：

explode 'attendees' to individual rows, then convert to columns with json_normalize, aggregate the data with groupby.agg and convert to_dict:

out = (df
   .explode('attendees', ignore_index=True)
   .pipe(lambda d: d.join(pd.json_normalize(d.pop('attendees'))))
   .groupby('email')
   .agg(**{'num_events_with': ('email', 'count'),
           'duration_of_events': ('duration', 'sum')
          })
   .to_dict(orient='index')
)

Output:

{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
 'email2': {'num_events_with': 2, 'duration_of_events': 10800},
 'email3': {'num_events_with': 1, 'duration_of_events': 7200}}

CodePudding user response：

Example

col = ['event', 'attendees', 'duration']
data = [['meeting', [{"id":1, "email": "email1"}, {"id":2, "email": "email2"}], 3600], ['lunch', [{"id":2, "email": "email2"}, {"id":3, "email": "email3"}],7200]]
df = pd.DataFrame(data, columns=col)

Code

df1 = df.explode('attendees')
grouper = df1['attendees'].str['email']
col1 = ['num_events_with', 'duration_of_events']
out = (df1.groupby(grouper)['duration'].agg(['count', sum]).T.set_axis(col1).to_dict())

out:

{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
 'email2': {'num_events_with': 2, 'duration_of_events': 10800},
 'email3': {'num_events_with': 1, 'duration_of_events': 7200}}

if you want 1 line use following

(df.explode('attendees').assign(attendees=lambda x:x['attendees'].str['email'])
 .groupby('attendees')['duration'].agg(['count',sum])
 .T.set_axis(['num_events_with', 'duration_of_events']).to_dict())