I have this dataframe:
event | attendees | duration |
---|---|---|
meeting | [{"id":1, "email": "email1"}, {"id":2, "email": "email2"}] | 3600 |
lunch | [{"id":2, "email": "email2"}, {"id":3, "email": "email3"}] | 7200 |
Which I am trying to turn into this dictionary:
{
email1: {
'num_events_with': 1,
'duration_of_events': 3600,
},
email2: {
'num_events_with': 2,
'duration_of_events': 10,800,
},
email3: {
'num_events_with': 1,
'duration_of_events': 7200,
},
}
except in my case the dataframe has thousands of rows and the dictionary I'm creating uses multiple columns to get the results for the dictionary keys so I need to be able to access the information that is relevant to each user email while creating the dictionary.
The purpose of the dictionary is to give information about the people the user has been in events with. So the first dictionary key is saying that the user has been in 1 event with email1 which lasted 3600 seconds.
Here is my approach to getting this dictionary:
# need to sort because I use diff() later
df.sort_values(by='startTime', ascending=True, inplace=True)
# a list of all contacts (emails) that have been in events with user
contacts = contacts_in_events
contact_info_dict = {}
df['attendees_str'] = df['attendees'].astype(str)
for contact in contacts:
temp_df = df[df['attendees_str'].str.contains(contact)]
duration_of_events = temp_df['duration'].sum()
num_events_with = len(temp_df.index)
contact_info_dict[contact] = {
'duration_of_events': duration_of_events,
'num_events_with': num_events_with
}
but this is too slow. Any ideas for how to do this in a different way that would be faster?
This is the output of the actual dataframe .to_dict('records'):
{
'creator': {
'displayName': None,
'email': 'creator of event',
'id': None,
'self': None
},
'start': {
'date': None,
'dateTime': '2022-09-13T12:30:00-04:00',
'timeZone': 'America/Toronto'
},
'end': {
'date': None,
'dateTime': '2022-09-13T13:00:00-04:00',
'timeZone': 'America/Toronto'
},
'attendees': [
{
'comment': None,
'displayName': None,
'email1': '[email protected]',
'responseStatus': 'accepted'
},
{
'comment': None,
'displayName': None,
'email': '[email protected]',
'responseStatus': 'accepted'
}
],
'summary': 'One on One Meeting',
'description': '...',
'calendarType': 'work',
'startTime': Timestamp('2022-09-13 16:30:00 0000', tz='UTC'),
'endTime': Timestamp('2022-09-13 17:00:00 0000', tz='UTC'),
'eventDuration': 1800.0,
'dowStart': 1.0,
'endStart': 1.0,
'weekday': True,
'startTOD': 59400,
'endTOD': 61200,
'day': Period('2022-09-13', 'D')
}
CodePudding user response:
explode
'attendees' to individual rows, then convert to columns with json_normalize
, aggregate the data with groupby.agg
and convert to_dict
:
out = (df
.explode('attendees', ignore_index=True)
.pipe(lambda d: d.join(pd.json_normalize(d.pop('attendees'))))
.groupby('email')
.agg(**{'num_events_with': ('email', 'count'),
'duration_of_events': ('duration', 'sum')
})
.to_dict(orient='index')
)
Output:
{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
'email2': {'num_events_with': 2, 'duration_of_events': 10800},
'email3': {'num_events_with': 1, 'duration_of_events': 7200}}
CodePudding user response:
Example
col = ['event', 'attendees', 'duration']
data = [['meeting', [{"id":1, "email": "email1"}, {"id":2, "email": "email2"}], 3600], ['lunch', [{"id":2, "email": "email2"}, {"id":3, "email": "email3"}],7200]]
df = pd.DataFrame(data, columns=col)
Code
df1 = df.explode('attendees')
grouper = df1['attendees'].str['email']
col1 = ['num_events_with', 'duration_of_events']
out = (df1.groupby(grouper)['duration'].agg(['count', sum]).T.set_axis(col1).to_dict())
out
:
{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
'email2': {'num_events_with': 2, 'duration_of_events': 10800},
'email3': {'num_events_with': 1, 'duration_of_events': 7200}}
if you want 1 line use following
(df.explode('attendees').assign(attendees=lambda x:x['attendees'].str['email'])
.groupby('attendees')['duration'].agg(['count',sum])
.T.set_axis(['num_events_with', 'duration_of_events']).to_dict())