Home > Mobile >  Dataframe sort and remove on date
Dataframe sort and remove on date

Time:07-13

I have the following data frame

import pandas as pd
from pandas import Timestamp

df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85}, 
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8}, 
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37}, 
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04}, 
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08}, 
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62}, 
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77}, 
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})

I scrape these from a website several times a day I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept. So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31 The entries on the previous day are made already okay manually :-( I intent to do this once a month so each day has several entries

I already tried

dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')

But that gives me al the records back because the time is different

Any one an idea how to do this?

CodePudding user response:

If the data is sorted by date, use a groupby.last:

df.groupby(df['Datum'].dt.date, as_index=False).last()

else:

df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]

output:

  Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds  \
0                    63.57                80.22               44.80   
1                    63.57                80.22                44.8   
2                    61.03                79.85                44.8   

  Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie  \
0              50.43               70.20                    46.03   
1              50.43                70.2                    46.03   
2              50.37               70.04                    46.08   

  Tactische Strategie Aandelen Groei Strategie               Datum  
0               48.69                    52.91 2022-07-08 18:00:00  
1               48.69                    52.91 2022-07-11 19:42:55  
2               48.62                    52.77 2022-07-12 15:59:31  

CodePudding user response:

Below a working example, where I keep only the date part of the timestamp to filter the dataframe:

df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)

CodePudding user response:

You can use .max() with datetime columns like this:

dfclean = df.loc[
    (df['Datum'].dt.date < df['Datum'].max().date()) | 
    (df['Datum'] == df['Datum'].max())
]

Output:

  Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds  \
0                    63.57                80.22               44.80   
1                    63.57                80.22                44.8   
6                    61.03                79.85                44.8   

  Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie  \
0              50.43               70.20                    46.03   
1              50.43                70.2                    46.03   
6              50.37               70.04                    46.08   

  Tactische Strategie Aandelen Groei Strategie               Datum  
0               48.69                    52.91 2022-07-08 18:00:00  
1               48.69                    52.91 2022-07-11 19:42:55  
6               48.62                    52.77 2022-07-12 15:59:31  

CodePudding user response:

Does this get you what you need?

df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]
  • Related