I have to run python script daily to extract only latest records of particular day & that latest records should be appended to one csv file.
here am having 4 columns startdate, enddate, virtual_id, value
data = 'startdate': {0:'2022-12-19 06:01:12 00:00',
1: '2022-12-19 13:49:09 00:00',
2: '2022-12-19 13:20:36 00:00',
3: '2022-12-08 13:55:09 00:00',
4: '2022-12-08 13:32:36 00:00',
5: '2023-01-09 15:14:07 00:00',
6: '2023-01-09 15:16:35 00:00'},
'enddate': {0:'2022-12-19 06:25:06 00:00',
1: '2022-12-19 13:59:29 00:00',
2: '2022-12-19 13:26:40 00:00',
3: '2022-12-08 13:55:27 00:00',
4: '2022-12-08 13:45:41 00:00',
5: '2023-12-09 15:29:39 00:00',
6: '2023-01-09 15:28:43 00:00'},
'virtual_id': {0: 1122, 1: 1125, 2: 1144,3: 1100, 4: 1136, 5: 1181, 6:1188},
'value': {0:2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000,5: 7200, 6: 81000}
}
df = pd.DataFrame(data)
Expected output for the 09th Jan if i run script on 09th Jan expected output should show records for the 8th Jan not previous day 07,06 etc to that similarly
data = 'startdate': {0: '2022-12-08 13:55:09 00:00',
1: '2022-12-08 13:32:36 00:00',},
'enddate': { 0: '2022-12-08 13:55:27 00:00',
1: '2022-12-08 13:45:41 00:00', },
'virtual_id': {0:1144, 1:1100},
'value': {0: 70000, 1: 8000}
}
if i run script on 10th Jan expected output should show records for the 9th Jan itself not 08th, 07th, 06th etc
data = 'startdate': {0: '2023-01-09 15:14:07 00:00',
1: '2023-01-09 15:16:35 00:00'},
'enddate': { 0: '2023-12-09 15:29:39 00:00',
1: '2023-01-09 15:28:43 00:00' },
'virtual_id': {0:1181, 1:1188},
'value': {0: 7200, 1: 81000}
}
i have tried below snippets but not working
data = data[data.startdate == data.startdate.max()] & data[data.enddate == data.enddate.max()]
data = data.loc[[data.enddate.idxmax()]]
CodePudding user response:
You can apply a costume function.
from datetime import datetime
import pandas as pd
def extract_latest_records(df):
df['startdate'] = pd.to_datetime(df['startdate'])
df['date'] = df['startdate'].dt.date
current_date = datetime.now().date()
df = df[df['date'] == current_date]
df = df.drop('date', axis=1)
df = df.reset_index(drop=True)
return df
df = pd.DataFrame(data)
latest_df = extract_latest_records(df)
print(latest_df)
Output:
startdate enddate virtual_id value
0 2023-01-09 15:14:07 00:00 2023-12-09 15:29:39 00:00 1122 8000
1 2023-01-09 15:16:35 00:00 2023-01-09 15:28:43 00:00 1188 90000
CodePudding user response:
I'm not sure if this works for you or not, but one issue you may be having is that your startdate
and enddate
columns are currently strings instead of datetime objects. Converting them to datetime
allows you to access different parts, such as the date, day, hour, minute etc.
The second thing is that I think you want to keep rows that are EITHER max start or max end date so that you end up with the last two rows as in your example. This means you need a |
instead of an &
df = pd.DataFrame(data)
df['startdate'] = pd.to_datetime(df['startdate']) #convert from str to datetime
df['enddate'] = pd.to_datetime(df['enddate']) #convert from str to datetime
#filter to rows w/ either the max start or end date
out_df = df[
df.startdate.dt.date.eq(df.startdate.dt.date.max())
| df.enddate.dt.date.eq(df.enddate.dt.date.max())
]
print(out_df)
Output:
startdate enddate virtual_id value
3 2023-01-09 15:14:07 00:00 2023-12-09 15:29:39 00:00 1122 8000
4 2023-01-09 15:16:35 00:00 2023-01-09 15:28:43 00:00 1188 90000