Let's say I have a dataset with data from the last 5 months for some people. I want to create some features related to each of those 5 months, like that:
My initial dataframe is looking like that:
Person | date | fruits eaten |
---|---|---|
Person A | 25-01-2015 | 2 |
Person B | 29-01-2015 | 3 |
Person A | 13-02-2015 | 0 |
Person A | 14-05-2015 | 10 |
Person A | 19-05-2015 | 11 |
I want my dataframe to look like that from below: So, creating 5 (or more columns, maybe 12 corresponding to all of the 12 months from 2015 year) related to month and year (grouping by the month and year) and then putting in the corresponding features their count and sum of fruits eaten columns based on the month and year.
Person | 01-2015_size | 02-2015_size | 03-2015_size | 04-2015_size | 05-2015_size | 01-2015_fruits | 02-2015_fruits | 03-2015_fruits | 04-2015_fruits | 05-2015_fruits |
---|---|---|---|---|---|---|---|---|---|---|
Person A | 2 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 21 |
Person B | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
Could you please suggest me how to do that?
CodePudding user response:
First of all, if the date column is a string, I recommend you to run the code below.
df.date = pd.to_datetime(df.date).dt.strftime('%m_%Y')
Then we run the group by process.
df = df.groupby(by=['Person','date'])['fruits eaten'].sum().reset_index()
Finally, we run the pivot operation.
df = df.pivot(index='Person', columns='date', values='fruits eaten')
I hope your problem will be solved.
CodePudding user response:
You can use groupby_agg
to create desired values then pivot
the dataframe. Before you have to create your custom index to get expected columns.
# Create your custom index (variables, dates)
dti = pd.date_range(df['date'].min(), df['date'].max() pd.DateOffset(months=1),
freq='M').strftime('%m-%Y')
mi = pd.MultiIndex.from_product([['size', 'fruits'], dti])
# Reformat your dataframe
out = df.assign(date=df['date'].dt.strftime('%m-%Y')) \
.groupby(['Person', 'date']) \
.agg(size=('date', 'size'), fruits=('fruits eaten', 'sum')) \
.unstack('date').reindex(columns=mi).fillna(0).astype(int)
# Rename your columns like 'month-year_variable'
out.columns = out.columns.swaplevel().to_flat_index().str.join('_')
Output:
>>> out
01-2015_size 02-2015_size 03-2015_size 04-2015_size 05-2015_size 01-2015_fruits 02-2015_fruits 03-2015_fruits 04-2015_fruits 05-2015_fruits
Person
Person A 1 1 0 0 2 2 0 0 0 21
Person B 1 0 0 0 0 3 0 0 0 0