Home > Enterprise >  Pandas data frame calculate difference from first rows
Pandas data frame calculate difference from first rows

Time:04-20

I have a dataset where I have the results of a model under many different scenarios and I want to capture the difference between each scenario and the baseline.

Example input data:

Scenario ID Value
Baseline 1 10
Baseline 2 5
Baseline 3 1
A 1 10
A 2 0
A 3 5

Desired result:

Scenario ID Difference from Baseline
A 1 0
A 2 -5
A 3 4

I have done this in the past manually in cases where there are 1-2 values which I want to subtract, but now I want to be able to do this for a variable number of columns. I think there will be 20-40 "Value" columns I want to subtract, anywhere from 2-50 unique IDs, and 60 scenarios.

I can think of two approaches to this but both seem inefficient:

  1. Start by separating into "baseline" and "all_scenarios" dfs. Get list of columns. Loop through each column in baseline df to rename. Join dataframes on ID. Loop through column list to perform subtraction.
  2. Start by separating into "baseline" and "all_scenarios" dfs. Get unique list of IDs. Loop through each ID and create a new baseline and all_scenario dfs with only that ID. Use df.subtract() to get the difference. Re-join all ID dataframes back together.

Is there a better way to approach this? How should I think about which version would be more efficient for large datasets?

CodePudding user response:

If possible match by ID filtered rows by condition for compare by Baseline use:

m = df['Scenario'].eq('Baseline')
df1 = df[m].set_index('ID').select_dtypes(np.number)
df2 = df[~m]

df = (df2.join(df2.set_index('ID').select_dtypes(np.number)
                  .sub(df1)
                  .add_prefix('Difference with '), 'ID')
                  .drop(df1.columns, axis=1)
         )
print (df)
  Scenario  ID  Difference with Value
3        A   1                      0
4        A   2                     -5
5        A   3                      4

CodePudding user response:

considering the following dataframe -

>>> df = pd.DataFrame(
    {'col1': ['Baseline','Baseline','Baseline', 'A', 'A','A'],
    'col2': [1,2,3,1,2,3],
    'col3': [10, 5, 1, 10, 0, 5]}
)
>>> print(df)
col1  col2  col3
0  Baseline     1    10
1  Baseline     2     5
2  Baseline     3     1
3         A     1    10
4         A     2     0
5         A     3     5

First you can create a pivot_table with col2

>>> df1 = df.pivot_table(index = 'col1', columns='col2', values='col3')
>>> print(df1)
col2       1  2  3
col1              
A         10  0  5
Baseline  10  5  1

We can transpose df1 -

>>> df2 = df1.transpose()
>>> print(df2)
col1   A  Baseline
col2              
1     10        10
2      0         5
3      5         1

Now add the difference in a new column

>>> df2['diff'] = df2[df2.columns[0]] - df2[df2.columns[1]]
>>> print(df2)

col1   A  Baseline  diff
col2                    
1     10        10     0
2      0         5    -5
3      5         1     4

Combining all of them in one place -

>>> dfPivot = df.pivot_table(index = 'col1', columns='col2', values='col3').transpose()
>>> dfPivot['diff'] = dfPivot[dfPivot.columns[0]] - dfPivot[dfPivot.columns[1]]
  • Related