Home > Enterprise >  Optimize building a new array column from other dataframes
Optimize building a new array column from other dataframes

Time:10-19

I have the following 3 dataframes that I am trying to combine:

Leads

                   Id            Name              Title
0  00Q6F00000zEkMXUA0       V.B Swamy  Managing Director
1  00Q6F00000zEkMXUA1    Vandana Suri      Founder & CEO
2  00Q6F00000zEkMXUA2      Jane Smith            Advisor

Campaigns

                    Id                      Name    NumberOfLeads
0   7016F000001Oo2xQAC        Testing Campaign A                1
1   7016F000001bHoHQAU        Testing Campaign B                2

Campaign Members

             CampaignId              LeadId
0    7016F000001Oo2xQAC  00Q6F00000zEkMXUA0
1    7016F000001bHoHQAU  00Q6F00000zEkMXUA0
2    7016F000001bHoHQAU  00Q6F00000zEkMXUA1

The final output I am trying to get to is:

Leads with Campaigns

                   Id            Name              Title                   Campaigns
0  00Q6F00000zEkMXUA0       V.B Swamy  Managing Director      ['Testing Campaign A', 'Testing Campaign B']
1  00Q6F00000zEkMXUA1    Vandana Suri      Founder & CEO      ['Testing Campaign B']
2  00Q6F00000zEkMXUA2      Jane Smith            Advisor      []

The above output is generated by getting the list of Campaign Members (which are Leads) for each Campaign, and then adding that info as a new Campaigns column in the Leads dataframe.

I was able to implement this myself with the following logic, but am running into issues when the Leads and Campaign Members dataframes are very large. My machine runs out of memory half way through processing the data.

# List of all campaign names we process
campaign_keys = []

for index, row in campaigns.iterrows():
    cname = row['Name']
    cid = row['Id']

    # Get members of this campaign
    matching_members = campaign_members[campaign_members['CampaignId'] == cid]
    
    # Create dataframe of campaign member lead ids
    campaign_df = matching_members['LeadId'].to_frame()
    campaign_df = campaign_df.rename(columns={'LeadId': 'Id'})
    campaign_df[cname] = 1

    # Add to array
    campaign_keys.append(cname)

    # Merge the campaign members with the leads df
    leads_df = leads_df.merge(campaign_df, how='left', left_on='Id', right_on='Id')

    
# Get only the columns for the campaigns we loaded
cdf = leads_df[campaign_keys]

# Build campaigns column
lists_entry = cdf.eq(1).apply(lambda x: list(x.index[x]), axis=1)
leads_df['Campaigns'] = lists_entry

# Drop all List columns
leads_df.drop(campaign_keys, axis=1)    

How can I optimize this code to handle larger payloads and more campaigns?

Any help would be greatly appreciated. Thanks!

CodePudding user response:

Try:

  1. merge the campaigns and members frames on the "CampaignId"
  2. groupby and create lists of campaigns for each "LeadId"
  3. merge the leads DataFrame with the newly created frame
campaign_members = campaigns.merge(members, left_on="Id", right_on="CampaignId", how="right")
campaign_lists = campaign_members.groupby("LeadId")["Name"].agg(list)
output = leads.merge(campaign_lists.rename("Campaigns"), left_on="Id", right_index=True, how="left")

>>> output

                    Id          Name                    Title               Campaigns
0   00Q6F00000zEkMXUA0     V.B Swamy        Managing Director       ['Testing Campaign A', 'Testing Campaign B']
1   00Q6F00000zEkMXUA1  Vandana Suri            Founder & CEO       ['Testing Campaign B']
2   00Q6F00000zEkMXUA2    Jane Smith                  Advisor   
  • Related