I have a question about the following data on a multiple-day match where people collected points on different days of the match. I want to sum the points gotten by one person (with a unique key) on that person's first, second, and third day and I have two data frames for this (see frames A and B). The problem is that the code I wrote to obtain the result, takes a long time and is very inefficient. I can't seem to figure out how to apply/map works in the case where two values have to be matched (not only person key but also the day). Can anybody help me out with this? I am using Pandas in Python.
Example:
Dataframe A
Person key | Start day | Day | Points |
---|---|---|---|
123 | 10 | 10 | 5 |
123 | 10 | 11 | 1 |
888 | 1 | 1 | 6 |
888 | 1 | 2 | 2 |
888 | 1 | 3 | 4 |
Dataframe B
Person key | Day 1 | Day 2 | Day 3 | points day 1 | points day 2 | points day 3 |
---|---|---|---|---|---|---|
123 | 10 | 11 | 12 | 0 | 0 | 0 |
888 | 1 | 2 | 3 | 0 | 0 | 0 |
What I would like to get:
Person key | Day 1 | Day 2 | Day 3 | points day 1 | points day 2 | points day 3 |
---|---|---|---|---|---|---|
123 | 10 | 11 | 12 | 5 | 1 | 0 |
888 | 1 | 2 | 3 | 6 | 2 | 4 |
The code I tried:
for i in range (0,len(dfB)):
for t in range(0,len(dfA)):
if dfB['Day 1'].iloc[i] == dfA['Day'].iloc[t] and dfB['Person key'].iloc[i] == dfA['Person key'].iloc[t]:
dfB['points day 1'].iloc[i] = dfA['Points'].iloc[t]
continue
if dfB['Day 2'].iloc[i] == dfA['Day'].iloc[t] and dfB['Person key'].iloc[i] == dfA['Person key'].iloc[t]:
dfB['points day 2'].iloc[i] = dfA['Points'].iloc[t]
continue
if dfB['Day 3'].iloc[i] == dfA['Day'].iloc[t] and dfB['Person key'].iloc[i] == dfA['Person key'].iloc[t]:
dfB['points day 3'].iloc[i] = dfA['Points'].iloc[t]
continue
CodePudding user response:
Try:
melt
dfA
to match the structure ofdfB
merge
both DataFramespivot
and rename columns to get the required format
melted = dfB.melt("Person key", ["Day 1", "Day 2", "Day 3"], value_name="Day")
output = melted.merge(dfA, on=["Person key", "Day"], how="left").drop("Start day", axis=1)
output = output.pivot("Person key", "variable", ["Day", "Points"]).fillna(0)
output.columns = output.columns.map(lambda x: x[1] if x[0]=="Day" else " ".join(x))
>>> output
Day 1 Day 2 Day 3 Points Day 1 Points Day 2 Points Day 3
Person key
123 10.0 11.0 12.0 5.0 1.0 0.0
888 1.0 2.0 3.0 6.0 2.0 4.0
CodePudding user response:
You might wanna convert your DataFrames to dictionarys, do your calculations and convert the reuslt dictionary back to a pandas DataFrame. You can just call the to_dict() function of your DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html and the from_dict() function to convert a dict to a pandas DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html
There are some nice entrys here on stackoverflow about the performance of dictionarys: