Home > Net >  Improving the efficiency of pandas nested for loops
Improving the efficiency of pandas nested for loops

Time:03-08

I have a question about the following data on a multiple-day match where people collected points on different days of the match. I want to sum the points gotten by one person (with a unique key) on that person's first, second, and third day and I have two data frames for this (see frames A and B). The problem is that the code I wrote to obtain the result, takes a long time and is very inefficient. I can't seem to figure out how to apply/map works in the case where two values have to be matched (not only person key but also the day). Can anybody help me out with this? I am using Pandas in Python.

Example:

Dataframe A

Person key Start day Day Points
123 10 10 5
123 10 11 1
888 1 1 6
888 1 2 2
888 1 3 4

Dataframe B

Person key Day 1 Day 2 Day 3 points day 1 points day 2 points day 3
123 10 11 12 0 0 0
888 1 2 3 0 0 0

What I would like to get:

Person key Day 1 Day 2 Day 3 points day 1 points day 2 points day 3
123 10 11 12 5 1 0
888 1 2 3 6 2 4

The code I tried:

for i in range (0,len(dfB)):
    for t in range(0,len(dfA)): 
        if dfB['Day 1'].iloc[i] == dfA['Day'].iloc[t] and dfB['Person key'].iloc[i] == dfA['Person key'].iloc[t]:
            dfB['points day 1'].iloc[i] = dfA['Points'].iloc[t]
            continue
        if dfB['Day 2'].iloc[i] == dfA['Day'].iloc[t] and dfB['Person key'].iloc[i] == dfA['Person key'].iloc[t]:
            dfB['points day 2'].iloc[i] = dfA['Points'].iloc[t]
            continue
        if dfB['Day 3'].iloc[i] == dfA['Day'].iloc[t] and dfB['Person key'].iloc[i] == dfA['Person key'].iloc[t]:
            dfB['points day 3'].iloc[i] = dfA['Points'].iloc[t]
            continue
    

CodePudding user response:

Try:

  1. melt dfA to match the structure of dfB
  2. merge both DataFrames
  3. pivot and rename columns to get the required format
melted = dfB.melt("Person key", ["Day 1", "Day 2", "Day 3"], value_name="Day")
output = melted.merge(dfA, on=["Person key", "Day"], how="left").drop("Start day", axis=1)
output = output.pivot("Person key", "variable", ["Day", "Points"]).fillna(0)
output.columns = output.columns.map(lambda x: x[1] if x[0]=="Day" else " ".join(x))

>>> output
            Day 1  Day 2  Day 3  Points Day 1  Points Day 2  Points Day 3
Person key                                                               
123          10.0   11.0   12.0           5.0           1.0           0.0
888           1.0    2.0    3.0           6.0           2.0           4.0

CodePudding user response:

You might wanna convert your DataFrames to dictionarys, do your calculations and convert the reuslt dictionary back to a pandas DataFrame. You can just call the to_dict() function of your DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html and the from_dict() function to convert a dict to a pandas DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html

There are some nice entrys here on stackoverflow about the performance of dictionarys:

Is there anything faster than dict()?

Pandas DataFrame performance

  • Related