Dataframe sort and remove on date-CodePudding

I have the following data frame

import pandas as pd
from pandas import Timestamp

df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85}, 
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8}, 
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37}, 
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04}, 
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08}, 
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62}, 
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77}, 
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})

I scrape these from a website several times a day I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept. So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31 The entries on the previous day are made already okay manually :-( I intent to do this once a month so each day has several entries

I already tried

dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')

But that gives me al the records back because the time is different

Any one an idea how to do this?

CodePudding user response：

If the data is sorted by date, use a groupby.last:

df.groupby(df['Datum'].dt.date, as_index=False).last()

else:

df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]

output:

  Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds  \
0                    63.57                80.22               44.80   
1                    63.57                80.22                44.8   
2                    61.03                79.85                44.8   

  Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie  \
0              50.43               70.20                    46.03   
1              50.43                70.2                    46.03   
2              50.37               70.04                    46.08   

  Tactische Strategie Aandelen Groei Strategie               Datum  
0               48.69                    52.91 2022-07-08 18:00:00  
1               48.69                    52.91 2022-07-11 19:42:55  
2               48.62                    52.77 2022-07-12 15:59:31

CodePudding user response：

Below a working example, where I keep only the date part of the timestamp to filter the dataframe:

df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)

CodePudding user response：

You can use .max() with datetime columns like this:

dfclean = df.loc[
    (df['Datum'].dt.date < df['Datum'].max().date()) | 
    (df['Datum'] == df['Datum'].max())
]

Output:

  Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds  \
0                    63.57                80.22               44.80   
1                    63.57                80.22                44.8   
6                    61.03                79.85                44.8   

  Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie  \
0              50.43               70.20                    46.03   
1              50.43                70.2                    46.03   
6              50.37               70.04                    46.08   

  Tactische Strategie Aandelen Groei Strategie               Datum  
0               48.69                    52.91 2022-07-08 18:00:00  
1               48.69                    52.91 2022-07-11 19:42:55  
6               48.62                    52.77 2022-07-12 15:59:31

CodePudding user response：

Does this get you what you need?

df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]