Create new features based on datetime and other columns-CodePudding

Let's say I have a dataset with data from the last 5 months for some people. I want to create some features related to each of those 5 months, like that:

My initial dataframe is looking like that:

Person	date	fruits eaten
Person A	25-01-2015	2
Person B	29-01-2015	3
Person A	13-02-2015	0
Person A	14-05-2015	10
Person A	19-05-2015	11

I want my dataframe to look like that from below: So, creating 5 (or more columns, maybe 12 corresponding to all of the 12 months from 2015 year) related to month and year (grouping by the month and year) and then putting in the corresponding features their count and sum of fruits eaten columns based on the month and year.

Person	01-2015_size	02-2015_size	03-2015_size	04-2015_size	05-2015_size	01-2015_fruits	02-2015_fruits	03-2015_fruits	04-2015_fruits	05-2015_fruits
Person A	2	1	0	0	2	2	0	0	0	21
Person B	1	0	0	0	0	3	0	0	0	0

Could you please suggest me how to do that?

CodePudding user response：

First of all, if the date column is a string, I recommend you to run the code below.

df.date = pd.to_datetime(df.date).dt.strftime('%m_%Y')

Then we run the group by process.

df = df.groupby(by=['Person','date'])['fruits eaten'].sum().reset_index()

Finally, we run the pivot operation.

df = df.pivot(index='Person', columns='date', values='fruits eaten')

I hope your problem will be solved.

CodePudding user response：

You can use groupby_agg to create desired values then pivot the dataframe. Before you have to create your custom index to get expected columns.

# Create your custom index (variables, dates)
dti = pd.date_range(df['date'].min(), df['date'].max()   pd.DateOffset(months=1),
                    freq='M').strftime('%m-%Y')
mi = pd.MultiIndex.from_product([['size', 'fruits'], dti])

# Reformat your dataframe
out = df.assign(date=df['date'].dt.strftime('%m-%Y')) \
        .groupby(['Person', 'date']) \
        .agg(size=('date', 'size'), fruits=('fruits eaten', 'sum')) \
        .unstack('date').reindex(columns=mi).fillna(0).astype(int)

# Rename your columns like 'month-year_variable'
out.columns = out.columns.swaplevel().to_flat_index().str.join('_')

Output:

>>> out
          01-2015_size  02-2015_size  03-2015_size  04-2015_size  05-2015_size  01-2015_fruits  02-2015_fruits  03-2015_fruits  04-2015_fruits  05-2015_fruits
Person                                                                                                                                                        
Person A             1             1             0             0             2               2               0               0               0              21
Person B             1             0             0             0             0               3               0               0               0               0