I have the following DataFrame
df = pd.DataFrame(
{
"date": ["2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 12:00:00"],
"plant_type": [1, 1, 1, 2, 3],
"cultivation_table": [1, 1, 1, 1, 2],
"farmer": [1, 1, 1, 1, 2],
"activity": ["water", "germinate", "compost", "water", "germinate"],
"duration": ["20s", "45s", "18s", "10min", "13min"],
"in_time": [0, 1, 1, 0, 1],
"finished": [1, 1, 1, 1, 0],
}
)
date plant_type cultivation_table farmer activity duration in_time finished
2022-03-01 10:00:00 1 1 1 water 20s 0 1
2022-03-01 10:00:00 1 1 1 germinate 45s 1 1
2022-03-01 10:00:00 1 1 1 compost 18s 1 1
2022-03-01 10:00:00 2 1 1 water 10min 0 1
2022-03-01 12:00:00 3 2 2 germinate 13min 1 0
I need to group by date, plant_type, cultivation_table, farmer, and keep activity, duration, in_time and finished in columns. I need to get a table like the following:
date plant_type cultivation_table farmer water water_in_time water_finished germinate germinate_in_time germinate_finished compost germinate_in_time germinate_finished
2022-03-01 10:00:00 1 1 1 20s 0 1 45s 1 1 18s 1 1
2022-03-01 10:00:00 2 1 1 10m 0 1 0s 0 0 0s 0 0
2022-03-01 12:00:00 3 2 2 0s 0 1 13min 1 0 0s 0 0
I was testing pivot and managed to get the following result:
date plant_type cultivation_table farmer activity compost germinate water
2022-03-01 10:00:00 1 1 1 water 18s 45s 20s
2022-03-01 10:00:00 2 1 1 water 0 0 10m
2022-03-01 12:00:00 3 2 2 germinate 0 13min 0
This is the code:
(df.groupby(["date", 'plant_type', 'cultivation_table', 'farmer'])['activity'].first().reset_index()
.merge(df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'], 'activity', 'duration')
.fillna(0).reset_index(), on=["date", 'plant_type', 'cultivation_table', 'farmer']))
CodePudding user response:
IIUC, pivot
suffices. The rest is a matter of how to fill the missing values:
out = df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'],
'activity',
['duration', 'in_time', 'finished'])
out['duration'] = out['duration'].fillna('0s')
out.loc[:, ['in_time','finished']] = out[['in_time','finished']].fillna(0)
out.columns = [y if x=='duration' else f'{y}_{x}' for x,y in out.columns]
out = out.sort_index(axis=1, ascending=False).reset_index()
Output:
date plant_type cultivation_table farmer water_in_time water_finished water germinate_in_time germinate_finished germinate compost_in_time compost_finished compost
0 2022-03-01 10:00:00 1 1 1 0 1 20s 1 1 45s 1 1 18s
1 2022-03-01 10:00:00 2 1 1 0 1 10min 0 0 0s 0 0 0s
2 2022-03-01 12:00:00 3 2 2 0 0 0s 1 0 13min 0 0 0s