Home > Blockchain >  Subsetting a dataframe greater than a specific date Pandas and append to another dataframe
Subsetting a dataframe greater than a specific date Pandas and append to another dataframe

Time:09-18

I have a dataframe (df1) with a date column with the dd/mm/yyyy date format.

I have second dataframe (df2) with the same structure, however, has some shared data. I want to add the data from df2 to df1 for the data is after the most recent date in df1.

My approach was to find the maxdate in df1 and then look for dates in df2 subset and append to df1

maxdate = df.loc[pd.to_datetime(df['DATE'],dayfirst=True).idxmax(), 'DATE']
# in this instance it is 11/09/2022 (dd/mm/yyyy)

df3 = df2.loc[pd.to_datetime(df2['DATE']) > maxdate] #this is to by my subset to append to df1

some of the df1 below

            DATE      TIME            X            Y             Z
3692  23/08/2022  16:55:00  734154.2872  9551189.353  2.845237e 03
3693  23/08/2022  16:55:00  734199.2516  9551070.666  2.842993e 03
3694  23/08/2022  05:02:00  734669.6130  9551361.865  2.845012e 03
3695  24/08/2022  17:25:00  734215.9910  9551068.295  2.842111e 03
3696  24/08/2022  17:25:00  734684.8444  9551383.618  2.846049e 03
3697  27/08/2022  17:20:00  734214.1851  9551061.242  2.841501e 03
3698  28/08/2022  17:00:00  734669.6130  9551361.865  2.845012e 03
3699  30/08/2022  05:25:00  734176.3412  9551168.550  2.844325e 03
3700  01/09/2022  17:18:00  734686.1061  9551385.420  2.846083e 03
3701  01/09/2022  17:18:00  734667.0922  9551358.264  2.844812e 03
3702  01/09/2022  17:18:00  734164.7047  9551178.039  2.844962e 03
3703  02/09/2022  17:16:00  734151.9079  9551185.951  2.845472e 03
3704  03/09/2022  17:15:00  734141.2542  9551197.062  2.844747e 03
3705  04/09/2022  17:08:00  734687.3678  9551387.222  2.846116e 03
3706  04/09/2022  17:08:00  734665.8319  9551356.464  2.844713e 03
3707  05/09/2022  05:08:00  734704.3326  9551376.581  2.842331e 03
3708  07/09/2022  16:58:00  734687.3678  9551387.222  2.846116e 03
3709  08/09/2022  16:55:00  734663.3109  9551352.864  2.844512e 03
3710  10/09/2022  17:03:00  734689.8913  9551390.826  2.846184e 03
3711  11/09/2022  17:13:00  734691.1530  9551392.628  9.551393e 06

some of df2 below

         DATE      TIME            X             Y             Z
134  23/08/22  16:55:00  734154.2872  9551189.3534      2845.237
135  23/08/22  16:55:00  734199.2516  9551070.6664     2842.9929
136  23/08/22   5:02:00   734669.613  9551361.8645     2845.0122
138  24/08/22  17:25:00   734215.991  9551068.2954     2842.1106
139  24/08/22  17:25:00  734684.8444   9551383.618     2846.0492
147  27/08/22  17:20:00  734214.1851  9551061.2423      2841.501
149  28/08/22  17:00:00   734669.613  9551361.8645     2845.0122
151  29/08/22  17:30:00            -             -             -
153  30/08/22   5:25:00  734176.3412  9551168.5498      2844.325
180  11/09/22  17:13:00   734691.153  9551392.6276  9551392.6276

However df3 is sub setting the dataframe that includes dates before the "maxdate"

I feel it is related to the date format that I have.

any help appreciated.

CodePudding user response:

You need to convert the values to pandas DateTime, else the comparison will be based on string values and not the dates, also its not clear if 11 is the day or 09 is the day in sample max date 11/09/2022, if 11 is the day, you also need to pass dayfirst=True to pd.to_datetime:

>>> maxdate=pd.to_datetime('11/09/2022')
# Timestamp('2022-11-09 00:00:00')

>>> df2 = df.loc[pd.to_datetime(df['DATE'], dayfirst=True) > maxdate]

Here is the execution for the sample data you have added to the question:

# Getting the max date from first dataframe
max_date=pd.to_datetime(df1['DATE'],dayfirst=True).max()
max_date
Timestamp('2022-09-11 00:00:00')

# Filtering second dataframe based on maximum date
df2[pd.to_datetime(df2['DATE'], dayfirst=True)>max_date]
Empty DataFrame
Columns: [DATE, TIME, X, Y, Z]
Index: []
# Result is empty dataframe for the sample data cause no record matches condition

# Records for maximum date:
df2[pd.to_datetime(df2['DATE'], dayfirst=True)==max_date]
         DATE      TIME           X             Y             Z
180  11/09/22  17:13:00  734691.153  9551392.6276  9551392.6276


# Records for dates older than the maximum date:
df2[pd.to_datetime(df2['DATE'], dayfirst=True)<max_date]
         DATE      TIME            X             Y          Z
134  23/08/22  16:55:00  734154.2872  9551189.3534   2845.237
135  23/08/22  16:55:00  734199.2516  9551070.6664  2842.9929
136  23/08/22   5:02:00   734669.613  9551361.8645  2845.0122
138  24/08/22  17:25:00   734215.991  9551068.2954  2842.1106
139  24/08/22  17:25:00  734684.8444   9551383.618  2846.0492
147  27/08/22  17:20:00  734214.1851  9551061.2423   2841.501
149  28/08/22  17:00:00   734669.613  9551361.8645  2845.0122
151  29/08/22  17:30:00            -             -          -
153  30/08/22   5:25:00  734176.3412  9551168.5498   2844.325
  • Related