Home > database >  How to aggregate if a condition is met in a transposed dataset?
How to aggregate if a condition is met in a transposed dataset?

Time:09-21

I am trying to figure out how to "aggregate" this transposed dataset. I am not sure if aggregate is the right word because the math is happening across rows. I have a dataframe that looks similar to such:

EDIT: There are multiple cases of the same value in "date." The data is transposed to the person ID. There are also Date1-5 columns as well. The date referenced in the below table is the one i ultimately home to aggregate by for the created NRev1-NRev# values.

Date Return1 Return2 Return3 Return4 Return5 Rev1 Rev2 Rev3 Rev4 Rev5
2020-1 0 1 2 3 4 100 500 100 200 300
2020-2 5 6 7 8 nan 200 120 100 200 nan
2020-3 2 3 7 9 nan 100 0 100 200 nan

and am trying to create additional revenue columns based upon their values of return, while adding together the values from rev1-rev5.

The resulting columns would look as follows:

Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8 NRev9
2020-1 100 500 100 200 300 0 0 0 0 0
2020-2 0 0 0 0 0 200 120 100 200 0
2020-3 0 0 100 0 0 0 0 100 0 200

Essentially, what I'm looking to do is to create a new variable "NRev," concatenated based upon the row value of "return." So if return1 = 4, for instance, NRev4 would equal the value of Rev1. The values of returns will change over time, but the number of return columns to revenue columns will always match. So theoretically, if there were a maximum value of 100 across all "Return" columns, the corresponding "Revenue" column would create "NRev100", and be filled with the corresponding revenue value's index.

In SPSS, I am able to create the columns using this code, but is non pythonic, and the number of return and rev columns will increase over time, as well as return values:

if return1=0 NRev0= NRev0 Rev1.
if return1=1 NRev1= NRev1 Rev1.
if return1=2 NRev2= NRev2 Rev1.
if return1=3 NRev3= NRev3 Rev1.
if return1=4 NRev4= NRev4 Rev1.

if return2=0 NRev0= NRev0 Rev2.
if return2=1 NRev1= NRev1 Rev2.
if return2=2 NRev2= NRev2 Rev2.
if return2=3 NRev3= NRev3 Rev2.
if return2=4 NRev4= NRev4 Rev2.

if return3=0 NRev0= NRev0 Rev3.
if return3=1 NRev1= NRev1 Rev3.
if return3=2 NRev2= NRev2 Rev3.
if return3=3 NRev3= NRev3 Rev3.
if return3=4 NRev4= NRev4 Rev3.

CodePudding user response:

We can do some reshaping with pd.wide_to_long then pivot_table back to wide format. This allows us to align Return and Rev lines then convert the Return values to the new columns. Some cleanup with add_prefix and rename_axis can be done to polish the output:

new_df = (
    pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
        .dropna()
        .astype({'Return': int})
        .pivot_table(index='Date', columns='Return', values='Rev', fill_value=0)
        .add_prefix('NRev')
        .rename_axis(columns=None)
        .reset_index()
)

new_df:

     Date  NRev0  NRev1  NRev2  NRev3  NRev4  NRev5  NRev6  NRev7  NRev8  NRev9
0  2020-1    100    500    100    200    300      0      0      0      0      0
1  2020-2      0      0      0      0      0    200    120    100    200      0
2  2020-3      0      0    100      0      0      0      0    100      0    200

wide_to_long gives:

          Return    Rev
Date   K               
2020-1 1     0.0  100.0  # Corresponding Return index and Rev are in the same row
2020-2 1     5.0  200.0
2020-3 1     2.0  100.0
2020-1 2     1.0  500.0
2020-2 2     6.0  120.0
2020-3 2     3.0    0.0
2020-1 3     2.0  100.0
2020-2 3     7.0  100.0
2020-3 3     7.0  100.0
2020-1 4     3.0  200.0
2020-2 4     8.0  200.0
2020-3 4     9.0  200.0
2020-1 5     4.0  300.0
2020-2 5     NaN    NaN
2020-3 5     NaN    NaN  # These NaN are Not Needed

The Removing NaN step and returning Return to int

(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
 .dropna()
 .astype({'Return': int}))
          Return    Rev
Date   K               
2020-1 1       0  100.0
2020-2 1       5  200.0
2020-3 1       2  100.0
2020-1 2       1  500.0
2020-2 2       6  120.0
2020-3 2       3    0.0
2020-1 3       2  100.0
2020-2 3       7  100.0
2020-3 3       7  100.0
2020-1 4       3  200.0
2020-2 4       8  200.0
2020-3 4       9  200.0
2020-1 5       4  300.0

Then this can easily be moved back to wide with a pivot_table:

(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
 .dropna()
 .astype({'Return': int})
 .pivot_table(index='Date', columns='Return', values='Rev', fill_value=0))
Return    0    1    2    3    4    5    6    7    8    9
Date                                                    
2020-1  100  500  100  200  300    0    0    0    0    0
2020-2    0    0    0    0    0  200  120  100  200    0
2020-3    0    0  100    0    0    0    0  100    0  200

The rest is just cosmetic changes to the DataFrame.


If dates are duplicated wide_to_long cannot be used, but we can manually reshape the DataFrame to wide with str.extract then set_index stack:

# Set Index Column
new_df = df.set_index('Date')
# Handle MultiIndex Manually
new_df.columns = pd.MultiIndex.from_frame(
    new_df.columns.str.extract('(.*)(\d )$')
)
# Stack then the rest is the same
new_df = (
    new_df.stack()
        .dropna()
        .astype({'Return': int})
        .pivot_table(index='Date', columns='Return', values='Rev',
                     fill_value=0, aggfunc='first')
        .add_prefix('NRev')
        .rename_axis(columns=None)
        .reset_index()
)

Sample DF with duplicate dates:

df = pd.DataFrame({'Date': ['2020-1', '2020-2', '2020-2'],
                   'Return1': [0, 5, 0],
                   'Return2': [1, 6, 1],
                   'Return3': [2, 7, 2],
                   'Return4': [3, 8, 3],
                   'Return5': [4.0, nan, 4.0],
                   'Rev1': [100, 200, 100],
                   'Rev2': [500, 120, 0],
                   'Rev3': [100, 100, 100],
                   'Rev4': [200, 200, 200],
                   'Rev5': [300.0, nan, nan]})

df

     Date  Return1  Return2  Return3  Return4  Return5  Rev1  Rev2  Rev3  Rev4   Rev5
0  2020-1        0        1        2        3      4.0   100   500   100   200  300.0
1  2020-2        5        6        7        8      NaN   200   120   100   200    NaN
2  2020-2        0        1        2        3      4.0   100     0   100   200    NaN

new_df

     Date  NRev0  NRev1  NRev2  NRev3  NRev4  NRev5  NRev6  NRev7  NRev8
0  2020-1    100    500    100    200    300      0      0      0      0
1  2020-2    100      0    100    200      0    200    120    100    200
  • Related