I have the following 3 dataframes that I am trying to combine:
Leads
Id Name Title
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO
2 00Q6F00000zEkMXUA2 Jane Smith Advisor
Campaigns
Id Name NumberOfLeads
0 7016F000001Oo2xQAC Testing Campaign A 1
1 7016F000001bHoHQAU Testing Campaign B 2
Campaign Members
CampaignId LeadId
0 7016F000001Oo2xQAC 00Q6F00000zEkMXUA0
1 7016F000001bHoHQAU 00Q6F00000zEkMXUA0
2 7016F000001bHoHQAU 00Q6F00000zEkMXUA1
The final output I am trying to get to is:
Leads with Campaigns
Id Name Title Campaigns
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director ['Testing Campaign A', 'Testing Campaign B']
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO ['Testing Campaign B']
2 00Q6F00000zEkMXUA2 Jane Smith Advisor []
The above output is generated by getting the list of Campaign Members (which are Leads) for each Campaign, and then adding that info as a new Campaigns column in the Leads dataframe.
I was able to implement this myself with the following logic, but am running into issues when the Leads and Campaign Members dataframes are very large. My machine runs out of memory half way through processing the data.
# List of all campaign names we process
campaign_keys = []
for index, row in campaigns.iterrows():
cname = row['Name']
cid = row['Id']
# Get members of this campaign
matching_members = campaign_members[campaign_members['CampaignId'] == cid]
# Create dataframe of campaign member lead ids
campaign_df = matching_members['LeadId'].to_frame()
campaign_df = campaign_df.rename(columns={'LeadId': 'Id'})
campaign_df[cname] = 1
# Add to array
campaign_keys.append(cname)
# Merge the campaign members with the leads df
leads_df = leads_df.merge(campaign_df, how='left', left_on='Id', right_on='Id')
# Get only the columns for the campaigns we loaded
cdf = leads_df[campaign_keys]
# Build campaigns column
lists_entry = cdf.eq(1).apply(lambda x: list(x.index[x]), axis=1)
leads_df['Campaigns'] = lists_entry
# Drop all List columns
leads_df.drop(campaign_keys, axis=1)
How can I optimize this code to handle larger payloads and more campaigns?
Any help would be greatly appreciated. Thanks!
CodePudding user response:
Try:
merge
the campaigns and members frames on the "CampaignId"groupby
and create lists of campaigns for each "LeadId"merge
the leads DataFrame with the newly created frame
campaign_members = campaigns.merge(members, left_on="Id", right_on="CampaignId", how="right")
campaign_lists = campaign_members.groupby("LeadId")["Name"].agg(list)
output = leads.merge(campaign_lists.rename("Campaigns"), left_on="Id", right_index=True, how="left")
>>> output
Id Name Title Campaigns
0 00Q6F00000zEkMXUA0 V.B Swamy Managing Director ['Testing Campaign A', 'Testing Campaign B']
1 00Q6F00000zEkMXUA1 Vandana Suri Founder & CEO ['Testing Campaign B']
2 00Q6F00000zEkMXUA2 Jane Smith Advisor