Speed up operations over Python Pandas dataframes-CodePudding

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.

The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.

This is the structure of the dataframes:

tempDF

   patid   eventdate consid prodcode issueseq
0  20001  21/04/2005   2728       85        0
1  25001  21/10/2000   3939       40        0
2  25001  21/02/2001   3950       37        0

drugUseDF

   index prodcode  ...      use1 use2
0    171      479  ...  diabetes  NaN
1    172     9105  ...  diabetes  NaN
2    173     5174  ...  diabetes  NaN

tempDrugUse

  use1
0  NaN
1  NaN
2  NaN

This is the code:

dfList = []                

# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
         
    #predine dataframe where we will store the results to be the same length as the main data dataframe.     
    tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])    

    #go through each row of the main data dataframe.
    for ind in range(len(tempDF)): 

        #retrieve the prodcode from the *ind* row of the main data dataframe
        prodcodeStr = tempDF.iloc[ind]["prodcode"]

        #get the corresponding value from the use1 column matching the prodcode column 
        useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]

        #update the storing dataframe
        tempDrugUse.iloc[ind]["use1"] = useStr

    print("[DEBUG] End of loop for use1")
    dfList.append(tempDrugUse)

The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.

CodePudding user response：

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

CodePudding user response：

If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):

tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
                       'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
                          'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')