I have a table with the following structure in pandas:
import pandas as pd
df = pd.DataFrame({
"user_id": [1, 2, 3],
"user_events": [1, 2, 3],
"user_successes": [0, 1, 2]
})
I would like to put it in a long format. In this case, we have, for each user, a different number of events, and successes. I would like to transform this into an event table (each row corresponds to an event, and there is a column that tells you whether it was a success or not). In this case, the output dataframe should be:
out_df = pd.DataFrame({
"user_id": [1, 2, 2, 3, 3, 3],
"success": [0, 1, 0, 1, 1, 0]
})
What's the simplest way of doing this in pandas? I would like to avoid for loops that iterate on each user, create dataframes and then append them.
CodePudding user response:
You may try with reindex
with repeat
then assign the value with limit condition created by groupby
cumsum
s = df.reindex(df.index.repeat(df.user_events))
s['success'] = 1
s['success'] = s['success'].where(s.groupby('user_id')['success'].cumsum()<=s['user_successes'],0)
s
Out[54]:
user_id user_events user_successes success
0 1 1 0 0
1 2 2 1 1
1 2 2 1 0
2 3 3 2 1
2 3 3 2 1
2 3 3 2 0
CodePudding user response:
One option with reindexing:
df2 = df.set_index('user_id')
out = (df
[['user_id']]
.loc[df.index.repeat(df2['user_events'])]
.assign(success=lambda d: d.groupby('user_id').cumcount()
.lt(d['user_id'].map(df2['user_successes']))
.astype(int)
)
)
output:
user_id success
0 1 0
1 2 1
1 2 0
2 3 1
2 3 1
2 3 0