I have a dataframe with subjects in two different conditions and many value columns.
d = {
"subject": [1, 1, 2, 2],
"condition": ["on", "off", "on", "off"],
"value": [1, 2, 3, 5]
}
df = pd.DataFrame(data=d)
df
subject | condition | value | |
---|---|---|---|
0 | 1 | on | 1 |
1 | 1 | off | 2 |
2 | 2 | on | 3 |
3 | 2 | off | 5 |
I would like to get new columns which indicate the difference off-on between both conditions. In this case I would like to get:
subject | condition | value | off-on | |
---|---|---|---|---|
0 | 1 | on | 1 | 1 |
1 | 1 | off | 2 | 1 |
2 | 2 | on | 3 | 2 |
3 | 2 | off | 5 | 2 |
How would I best do that?
I could achieve the result using this code:
onoff = (df[df.condition == "off"].value.reset_index() - df[df.condition == "on"].value.reset_index()).value
for idx, sub in enumerate(df.subject.unique()):
df.loc[df.subject == sub, "off-on"] = onoff.iloc[idx]
But it seems quite tedious and slow. I was hoping for a solution without loop. I have many rows and very many value columns. Is there a better way?
CodePudding user response:
Use a pivot
combined with map
:
df['off-on'] = df['subject'].map(
df.pivot(index='subject', columns='condition', values='value')
.eval('off-on')
)
Or with a MultiIndex (more efficient than a pivot
):
s = df.set_index(['condition', 'subject'])['value']
df['off-on'] = df['subject'].map(s['off']-s['on'])
Output:
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
timings
On 100k subjects
# MultiIndexing
43.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pivot
77 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Use DataFrame.pivot
for possible easy mapping subtracted column off
and on
by Series.map
:
df1 = df.pivot(index='subject', columns='condition', values='value')
df['off-on'] = df['subject'].map(df1['off'].sub(df1['on']))
print (df)
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
Details:
print (df.pivot(index='subject', columns='condition', values='value'))
condition off on
subject
1 2 1
2 5 3
print (df1['off'].sub(df1['on']))
subject
1 1
2 2
dtype: int64