Home > database >  Create new features based on datetime and other columns
Create new features based on datetime and other columns

Time:11-01

Let's say I have a dataset with data from the last 5 months for some people. I want to create some features related to each of those 5 months, like that:

My initial dataframe is looking like that:

Person date fruits eaten
Person A 25-01-2015 2
Person B 29-01-2015 3
Person A 13-02-2015 0
Person A 14-05-2015 10
Person A 19-05-2015 11

I want my dataframe to look like that from below: So, creating 5 (or more columns, maybe 12 corresponding to all of the 12 months from 2015 year) related to month and year (grouping by the month and year) and then putting in the corresponding features their count and sum of fruits eaten columns based on the month and year.

Person 01-2015_size 02-2015_size 03-2015_size 04-2015_size 05-2015_size 01-2015_fruits 02-2015_fruits 03-2015_fruits 04-2015_fruits 05-2015_fruits
Person A 2 1 0 0 2 2 0 0 0 21
Person B 1 0 0 0 0 3 0 0 0 0

Could you please suggest me how to do that?

CodePudding user response:

First of all, if the date column is a string, I recommend you to run the code below.

df.date = pd.to_datetime(df.date).dt.strftime('%m_%Y')

Then we run the group by process.

df = df.groupby(by=['Person','date'])['fruits eaten'].sum().reset_index()

Finally, we run the pivot operation.

df = df.pivot(index='Person', columns='date', values='fruits eaten')

I hope your problem will be solved.

CodePudding user response:

You can use groupby_agg to create desired values then pivot the dataframe. Before you have to create your custom index to get expected columns.

# Create your custom index (variables, dates)
dti = pd.date_range(df['date'].min(), df['date'].max()   pd.DateOffset(months=1),
                    freq='M').strftime('%m-%Y')
mi = pd.MultiIndex.from_product([['size', 'fruits'], dti])

# Reformat your dataframe
out = df.assign(date=df['date'].dt.strftime('%m-%Y')) \
        .groupby(['Person', 'date']) \
        .agg(size=('date', 'size'), fruits=('fruits eaten', 'sum')) \
        .unstack('date').reindex(columns=mi).fillna(0).astype(int)

# Rename your columns like 'month-year_variable'
out.columns = out.columns.swaplevel().to_flat_index().str.join('_')

Output:

>>> out
          01-2015_size  02-2015_size  03-2015_size  04-2015_size  05-2015_size  01-2015_fruits  02-2015_fruits  03-2015_fruits  04-2015_fruits  05-2015_fruits
Person                                                                                                                                                        
Person A             1             1             0             0             2               2               0               0               0              21
Person B             1             0             0             0             0               3               0               0               0               0
  • Related