pd.Grouper() when applied on datetime, changes the original column of dates-CodePudding

I have a sample dataframe from my huge dataframe as shown given below.

import pandas as pd
import numpy as np

NaN = np.nan
data = {
    'ID':['AAQRB','AAQRB','AAQRB', 
'AHXSJ','AHXSJ','AHXSJ','GABOY','GABOY','GABOY','GHZGS','GHZGS','GHZGS'],
    'Date':['10/18/2021  10:52:53 PM','10/18/2021  10:53:55 PM', '10/25/2021  5:55:43 PM',
           '10/22/2021  10:37:06 PM','10/22/2021  10:38:22 PM','10/22/2021  10:39:56 PM',
           '11/1/2021  1:27:15 AM','11/1/2021  1:28:45 AM','11/2/2021  8:53:39 PM',
           '10/29/2021  11:13:57 PM', '10/29/2021  11:17:47 PM', '10/29/2021  11:19:15 PM'], 
    'Race_x':[NaN,NaN,NaN,NaN,NaN,1,NaN,NaN,1, NaN,NaN,1],
    'Vaccine':['TRUE',NaN,NaN,'TRUE',NaN,NaN,'TRUE',NaN,NaN,'FALSE',NaN,NaN],
    'Study_activity': 
   [NaN,'continue',NaN,NaN,'continue',NaN,NaN,'continue',NaN,NaN,'continue',NaN],
    'Who_Contacted': 
   [NaN,NaN,'WeContacted',NaN,NaN,'WeContacted',NaN,NaN,NaN,NaN,NaN,'WeContacted']}

test_df = pd.DataFrame(data)

Goal is to get all the first values for each ID and filter the several rows of participant to a single row with all information. The final dataframe should look like the image given below.

CODE TRIED

I tried using the Grouper() function, and the code is given below.

test_df['Date'] = pd.to_datetime(test_df['Date'])

test_df1 = (test_df.groupby(['ID', pd.Grouper(key='Date', freq='D')])
   .agg("first")
   .reset_index())
 baseline_df = test_df1[~test_df1.duplicated(subset = ['ID'], keep='first')]

But the problem with this is, if I use freq='D', then the Race_x values are missed which are entered the next day. the output looks like the image shown below.

If I use freq='M' or freq='Y', the other values are captured, however the Dates column values are changed, and we get the date of end of month for each ID as shown below.

The final 'Date' column should be the first entry of the 'date' for each ID and it should not change.

Any help is greatly appreciated. Thank you!

CodePudding user response：

Looks like you want to groupby ID only and aggregate Date as first, everything else as whenever you have a valid value.

Assuming all of Race_x, Vaccine, Study_activity, Who_Contacted is always a single non-NaN value for an ID. You can bfill first before aggregation.

This will collect the non-NaN value to the first entry for a participant.

test_df['Date'] = pd.to_datetime('Date').dt.date
test_df.update(test_df.groupby('ID').bfill())

Then, try aggregation.

test_df.groupby('ID').first().reset_index()


>>>     ID        Date Race_x Vaccine Study_activity Who_Contacted
 0   AAQRB  2021-10-18    NaN    TRUE       continue WeContacted
 1   AHXSJ  2021-10-22    1.0    TRUE       continue WeContacted
 2   GABOY  2021-11-01    1.0    TRUE       continue       NaN
 3   GHZGS  2021-10-29    1.0   FALSE       continue WeContacted

CodePudding user response：

Create a virtual column to group by month:

>>> test_df.assign(month=test_df['Date'].dt.strftime('%Y-%m')) \
           .groupby(['ID', 'month']).agg('first') \
           .droplevel(1).reset_index() \
           .assign(Date=lambda x: x['Date'].dt.date)

      ID        Date  Race_x Vaccine Study_activity Who_Contacted
0  AAQRB  2021-10-18     NaN    TRUE       continue   WeContacted
1  AHXSJ  2021-10-22     1.0    TRUE       continue   WeContacted
2  GABOY  2021-11-01     1.0    TRUE       continue          None
3  GHZGS  2021-10-29     1.0   FALSE       continue   WeContacted