I am trying to figure out how to "aggregate" this transposed dataset. I am not sure if aggregate is the right word because the math is happening across rows. I have a dataframe that looks similar to such:
EDIT: There are multiple cases of the same value in "date." The data is transposed to the person ID. There are also Date1-5 columns as well. The date referenced in the below table is the one i ultimately home to aggregate by for the created NRev1-NRev# values.
Date | Return1 | Return2 | Return3 | Return4 | Return5 | Rev1 | Rev2 | Rev3 | Rev4 | Rev5 |
---|---|---|---|---|---|---|---|---|---|---|
2020-1 | 0 | 1 | 2 | 3 | 4 | 100 | 500 | 100 | 200 | 300 |
2020-2 | 5 | 6 | 7 | 8 | nan | 200 | 120 | 100 | 200 | nan |
2020-3 | 2 | 3 | 7 | 9 | nan | 100 | 0 | 100 | 200 | nan |
and am trying to create additional revenue columns based upon their values of return, while adding together the values from rev1-rev5.
The resulting columns would look as follows:
Date | NRev0 | NRev1 | NRev2 | NRev3 | NRev4 | NRev5 | NRev6 | NRev7 | NRev8 | NRev9 |
---|---|---|---|---|---|---|---|---|---|---|
2020-1 | 100 | 500 | 100 | 200 | 300 | 0 | 0 | 0 | 0 | 0 |
2020-2 | 0 | 0 | 0 | 0 | 0 | 200 | 120 | 100 | 200 | 0 |
2020-3 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 100 | 0 | 200 |
Essentially, what I'm looking to do is to create a new variable "NRev," concatenated based upon the row value of "return." So if return1 = 4, for instance, NRev4 would equal the value of Rev1. The values of returns will change over time, but the number of return columns to revenue columns will always match. So theoretically, if there were a maximum value of 100 across all "Return" columns, the corresponding "Revenue" column would create "NRev100", and be filled with the corresponding revenue value's index.
In SPSS, I am able to create the columns using this code, but is non pythonic, and the number of return and rev columns will increase over time, as well as return values:
if return1=0 NRev0= NRev0 Rev1.
if return1=1 NRev1= NRev1 Rev1.
if return1=2 NRev2= NRev2 Rev1.
if return1=3 NRev3= NRev3 Rev1.
if return1=4 NRev4= NRev4 Rev1.
if return2=0 NRev0= NRev0 Rev2.
if return2=1 NRev1= NRev1 Rev2.
if return2=2 NRev2= NRev2 Rev2.
if return2=3 NRev3= NRev3 Rev2.
if return2=4 NRev4= NRev4 Rev2.
if return3=0 NRev0= NRev0 Rev3.
if return3=1 NRev1= NRev1 Rev3.
if return3=2 NRev2= NRev2 Rev3.
if return3=3 NRev3= NRev3 Rev3.
if return3=4 NRev4= NRev4 Rev3.
CodePudding user response:
We can do some reshaping with pd.wide_to_long
then pivot_table
back to wide format. This allows us to align Return and Rev lines then convert the Return values to the new columns. Some cleanup with add_prefix
and rename_axis
can be done to polish the output:
new_df = (
pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0)
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)
new_df
:
Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8 NRev9
0 2020-1 100 500 100 200 300 0 0 0 0 0
1 2020-2 0 0 0 0 0 200 120 100 200 0
2 2020-3 0 0 100 0 0 0 0 100 0 200
wide_to_long
gives:
Return Rev
Date K
2020-1 1 0.0 100.0 # Corresponding Return index and Rev are in the same row
2020-2 1 5.0 200.0
2020-3 1 2.0 100.0
2020-1 2 1.0 500.0
2020-2 2 6.0 120.0
2020-3 2 3.0 0.0
2020-1 3 2.0 100.0
2020-2 3 7.0 100.0
2020-3 3 7.0 100.0
2020-1 4 3.0 200.0
2020-2 4 8.0 200.0
2020-3 4 9.0 200.0
2020-1 5 4.0 300.0
2020-2 5 NaN NaN
2020-3 5 NaN NaN # These NaN are Not Needed
The Removing NaN step and returning Return to int
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int}))
Return Rev
Date K
2020-1 1 0 100.0
2020-2 1 5 200.0
2020-3 1 2 100.0
2020-1 2 1 500.0
2020-2 2 6 120.0
2020-3 2 3 0.0
2020-1 3 2 100.0
2020-2 3 7 100.0
2020-3 3 7 100.0
2020-1 4 3 200.0
2020-2 4 8 200.0
2020-3 4 9 200.0
2020-1 5 4 300.0
Then this can easily be moved back to wide with a pivot_table
:
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0))
Return 0 1 2 3 4 5 6 7 8 9
Date
2020-1 100 500 100 200 300 0 0 0 0 0
2020-2 0 0 0 0 0 200 120 100 200 0
2020-3 0 0 100 0 0 0 0 100 0 200
The rest is just cosmetic changes to the DataFrame.
If dates are duplicated wide_to_long
cannot be used, but we can manually reshape the DataFrame to wide with str.extract
then set_index
stack
:
# Set Index Column
new_df = df.set_index('Date')
# Handle MultiIndex Manually
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extract('(.*)(\d )$')
)
# Stack then the rest is the same
new_df = (
new_df.stack()
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev',
fill_value=0, aggfunc='first')
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)
Sample DF with duplicate dates:
df = pd.DataFrame({'Date': ['2020-1', '2020-2', '2020-2'],
'Return1': [0, 5, 0],
'Return2': [1, 6, 1],
'Return3': [2, 7, 2],
'Return4': [3, 8, 3],
'Return5': [4.0, nan, 4.0],
'Rev1': [100, 200, 100],
'Rev2': [500, 120, 0],
'Rev3': [100, 100, 100],
'Rev4': [200, 200, 200],
'Rev5': [300.0, nan, nan]})
df
Date Return1 Return2 Return3 Return4 Return5 Rev1 Rev2 Rev3 Rev4 Rev5
0 2020-1 0 1 2 3 4.0 100 500 100 200 300.0
1 2020-2 5 6 7 8 NaN 200 120 100 200 NaN
2 2020-2 0 1 2 3 4.0 100 0 100 200 NaN
new_df
Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8
0 2020-1 100 500 100 200 300 0 0 0 0
1 2020-2 100 0 100 200 0 200 120 100 200