Increase Speed of Nested For Loops While Changing Value of a DataFrame-CodePudding

I'm looking to increase the speed of the nested for loops.

VARIABLES:

'dataframe' - The dataframe I am attempting to modify in the second for loop. It consists of a multitude of training sessions for the same people. This is the attendance document that is changed if a match exists in the reporting dataframe.

'dictNewNames' - This is a dictionary of session title names. The key is the longer session title name and the value is a stripped session title name. For example {'Week 1: Training': 'Training'} etc. The key is equal to the 'Session Title' column in each row but the value is used for searching a substring in the second for loop.

'reporting' - A dataframe that includes information regarding session titles and attendance participation. The reporting dataframe is already filtered so everyone in the 'reporting' dataframe should get credit in 'dataframe'. The only caveat is that the 'search' name is nested within the pathway title.

dataframe = {
'Session Title': ['Week 1: Train 1', 'Week 2: Train 2', 'Week 3: Train 3'],
'Attendee Email': ['[email protected]', '[email protected]', '[email protected]'],
'Completed': ['No', 'No', 'No'],
'Date Completed': ['','','']}

dictNewNames = { 'Week 1: Train 1': 'Train 1', 'Week 2: Train 2': 'Train 2', 'Week 3: Train 3': 'Train 3' }

reporting = {
'Pathway Title': ['One: Train 1', 'Two: Train 2', 'Three: Train 3'],
'Email': ['[email protected]', '[email protected]', '[email protected]'],
'Date Completed': ['xx/yy/xx', 'yy/xx/zz', 'zz/xx/yy']}

My code:

def giveCredit(dataframe, dictNewNames, reporting):

    for index, row in dataframe.iterrows():


        temp = row['Session Title']
        searchName = dictNewNames[temp]

        attendeeEmail = row['Attendee: Email']


        for index1, row1 in reporting.iterrows():

            pathwayTitle = row1['Pathway Title']
            Email = row1['Organization Email']
            dateCompleted = row1['Date Completed']

            if attendeeEmail == Email and searchName in pathwayTitle:
                dataframe.at[index, 'Completed'] = 'Yes'
                dataframe.at[index, 'Date Completed'] = dateCompleted
                break

    return dataframe

CodePudding user response：

Your pattern looks like merge:

for loop1 on first dataframe:
   for loop2 on second dataframe:
       if conditions match between both dataframes:

So:

# Create a common key Name based on dictNewNames
pat = fr"({'|'.join(dictNewNames.values())})"
name1 = dataframe['Session Title'].map(dictNewNames)
name2 = reporting['Pathway Title'].str.extract(pat)

# Merge dataframes based on this key
out = pd.merge(dataframe.assign(Name=name1),
               reporting.assign(Name=name2),
               how='left', on='Name', suffixes=(None, '_'))

# Update the dataframe
out['Date Completed'] = out.pop('Date Completed_')
out['Completed'] = np.where(out['Date Completed'].notna(), 'Yes', 'No')
out = out[dataframe.columns]

Output:

>>> out
     Session Title   Attendee Email Completed Date Completed
0  Week 1: Train 1   [email protected]       Yes       xx/yy/xx
1  Week 2: Train 2  [email protected]       Yes       yy/xx/zz
2  Week 3: Train 3  [email protected]       Yes       zz/xx/yy

CodePudding user response：

This workaround cut my execution time from 460 seconds to under 12.

def giveCredit(dataframe, dictNewNames, reporting):

    
    for index1, row in dataframe.iterrows():

        temp = row['Session Title']
        searchName = dictNewNames[temp]

        attendeeEmail = row['Attendee: Email']

        row = reporting.loc[(reporting['Pathway Title'].str.contains(searchName, case=False)) & (reporting['Email'] == attendeeEmail)]

        if len(row.index) != 0:

            dateCompleted = row['Date Completed']
            dateCompleted = dateCompleted.to_string(buf=None, header=False, index=False, length=False, name=False, max_rows=None)

            dataframe.at[index1, 'Completed'] = 'Yes'
            dataframe.at[index1, 'Date Completed'] = dateCompleted

    return dataframe