I have a dataset where I have the results of a model under many different scenarios and I want to capture the difference between each scenario and the baseline.
Example input data:
Scenario | ID | Value |
---|---|---|
Baseline | 1 | 10 |
Baseline | 2 | 5 |
Baseline | 3 | 1 |
A | 1 | 10 |
A | 2 | 0 |
A | 3 | 5 |
Desired result:
Scenario | ID | Difference from Baseline |
---|---|---|
A | 1 | 0 |
A | 2 | -5 |
A | 3 | 4 |
I have done this in the past manually in cases where there are 1-2 values which I want to subtract, but now I want to be able to do this for a variable number of columns. I think there will be 20-40 "Value" columns I want to subtract, anywhere from 2-50 unique IDs, and 60 scenarios.
I can think of two approaches to this but both seem inefficient:
- Start by separating into "baseline" and "all_scenarios" dfs. Get list of columns. Loop through each column in baseline df to rename. Join dataframes on ID. Loop through column list to perform subtraction.
- Start by separating into "baseline" and "all_scenarios" dfs. Get unique list of IDs. Loop through each ID and create a new baseline and all_scenario dfs with only that ID. Use df.subtract() to get the difference. Re-join all ID dataframes back together.
Is there a better way to approach this? How should I think about which version would be more efficient for large datasets?
CodePudding user response:
If possible match by ID
filtered rows by condition for compare by Baseline
use:
m = df['Scenario'].eq('Baseline')
df1 = df[m].set_index('ID').select_dtypes(np.number)
df2 = df[~m]
df = (df2.join(df2.set_index('ID').select_dtypes(np.number)
.sub(df1)
.add_prefix('Difference with '), 'ID')
.drop(df1.columns, axis=1)
)
print (df)
Scenario ID Difference with Value
3 A 1 0
4 A 2 -5
5 A 3 4
CodePudding user response:
considering the following dataframe -
>>> df = pd.DataFrame(
{'col1': ['Baseline','Baseline','Baseline', 'A', 'A','A'],
'col2': [1,2,3,1,2,3],
'col3': [10, 5, 1, 10, 0, 5]}
)
>>> print(df)
col1 col2 col3
0 Baseline 1 10
1 Baseline 2 5
2 Baseline 3 1
3 A 1 10
4 A 2 0
5 A 3 5
First you can create a pivot_table
with col2
>>> df1 = df.pivot_table(index = 'col1', columns='col2', values='col3')
>>> print(df1)
col2 1 2 3
col1
A 10 0 5
Baseline 10 5 1
We can transpose df1
-
>>> df2 = df1.transpose()
>>> print(df2)
col1 A Baseline
col2
1 10 10
2 0 5
3 5 1
Now add the difference in a new column
>>> df2['diff'] = df2[df2.columns[0]] - df2[df2.columns[1]]
>>> print(df2)
col1 A Baseline diff
col2
1 10 10 0
2 0 5 -5
3 5 1 4
Combining all of them in one place -
>>> dfPivot = df.pivot_table(index = 'col1', columns='col2', values='col3').transpose()
>>> dfPivot['diff'] = dfPivot[dfPivot.columns[0]] - dfPivot[dfPivot.columns[1]]