I have the following data frame
import pandas as pd
from pandas import Timestamp
df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85},
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8},
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37},
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04},
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08},
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62},
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77},
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})
I scrape these from a website several times a day I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept. So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31 The entries on the previous day are made already okay manually :-( I intent to do this once a month so each day has several entries
I already tried
dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')
But that gives me al the records back because the time is different
Any one an idea how to do this?
CodePudding user response:
If the data is sorted by date, use a groupby.last
:
df.groupby(df['Datum'].dt.date, as_index=False).last()
else:
df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]
output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
2 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
2 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
2 48.62 52.77 2022-07-12 15:59:31
CodePudding user response:
Below a working example, where I keep only the date part of the timestamp to filter the dataframe:
df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)
CodePudding user response:
You can use .max()
with datetime columns like this:
dfclean = df.loc[
(df['Datum'].dt.date < df['Datum'].max().date()) |
(df['Datum'] == df['Datum'].max())
]
Output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
6 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
6 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
6 48.62 52.77 2022-07-12 15:59:31
CodePudding user response:
Does this get you what you need?
df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]