Home > Blockchain >  Extract each day rows automatically based on start date & end date columns from API using python
Extract each day rows automatically based on start date & end date columns from API using python

Time:01-10

I have to run python script daily to extract only latest records of particular day & that latest records should be appended to one csv file.

here am having 4 columns startdate, enddate, virtual_id, value

data = 'startdate': {0:'2022-12-19 06:01:12 00:00', 
                           1: '2022-12-19 13:49:09 00:00', 
                           2: '2022-12-19 13:20:36 00:00',
                           3: '2022-12-08 13:55:09 00:00', 
                           4: '2022-12-08 13:32:36 00:00', 
                           5: '2023-01-09 15:14:07 00:00', 
                           6: '2023-01-09 15:16:35 00:00'}, 
        'enddate': {0:'2022-12-19 06:25:06 00:00', 
                         1: '2022-12-19 13:59:29 00:00', 
                         2: '2022-12-19 13:26:40 00:00', 
                         3: '2022-12-08 13:55:27 00:00', 
                         4: '2022-12-08 13:45:41 00:00', 
                         5: '2023-12-09 15:29:39 00:00', 
                         6: '2023-01-09 15:28:43 00:00'}, 
        'virtual_id': {0: 1122, 1: 1125, 2: 1144,3: 1100, 4: 1136, 5: 1181, 6:1188}, 
        'value': {0:2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000,5: 7200, 6: 81000}
        }

df = pd.DataFrame(data)

Expected output for the 09th Jan if i run script on 09th Jan expected output should show records for the 8th Jan not previous day 07,06 etc to that similarly

data = 'startdate': {0: '2022-12-08 13:55:09 00:00', 
                     1: '2022-12-08 13:32:36 00:00',}, 
        'enddate': { 0: '2022-12-08 13:55:27 00:00', 
                     1: '2022-12-08 13:45:41 00:00', }, 
        'virtual_id': {0:1144, 1:1100}, 
        'value': {0: 70000, 1: 8000}
        }

if i run script on 10th Jan expected output should show records for the 9th Jan itself not 08th, 07th, 06th etc

data = 'startdate': {0: '2023-01-09 15:14:07 00:00', 
                     1: '2023-01-09 15:16:35 00:00'}, 
        'enddate': { 0: '2023-12-09 15:29:39 00:00', 
                     1: '2023-01-09 15:28:43 00:00' }, 
        'virtual_id': {0:1181, 1:1188}, 
        'value': {0: 7200, 1: 81000}
        }

i have tried below snippets but not working

data = data[data.startdate == data.startdate.max()] & data[data.enddate == data.enddate.max()]

data = data.loc[[data.enddate.idxmax()]]

CodePudding user response:

You can apply a costume function.

from datetime import datetime
import pandas as pd

def extract_latest_records(df):
    df['startdate'] = pd.to_datetime(df['startdate'])
    df['date'] = df['startdate'].dt.date
    current_date = datetime.now().date()
    df = df[df['date'] == current_date] 
    df = df.drop('date', axis=1)
    df = df.reset_index(drop=True)
    return df
    
df = pd.DataFrame(data)

latest_df = extract_latest_records(df)
print(latest_df)

Output:

                  startdate                    enddate  virtual_id  value
0 2023-01-09 15:14:07 00:00  2023-12-09 15:29:39 00:00        1122   8000
1 2023-01-09 15:16:35 00:00  2023-01-09 15:28:43 00:00        1188  90000

CodePudding user response:

I'm not sure if this works for you or not, but one issue you may be having is that your startdate and enddate columns are currently strings instead of datetime objects. Converting them to datetime allows you to access different parts, such as the date, day, hour, minute etc.

The second thing is that I think you want to keep rows that are EITHER max start or max end date so that you end up with the last two rows as in your example. This means you need a | instead of an &

df = pd.DataFrame(data)
df['startdate'] = pd.to_datetime(df['startdate']) #convert from str to datetime
df['enddate'] = pd.to_datetime(df['enddate']) #convert from str to datetime

#filter to rows w/ either the max start or end date
out_df = df[
    df.startdate.dt.date.eq(df.startdate.dt.date.max())
    | df.enddate.dt.date.eq(df.enddate.dt.date.max())
]

print(out_df)

Output:

                  startdate                   enddate  virtual_id  value
3 2023-01-09 15:14:07 00:00 2023-12-09 15:29:39 00:00        1122   8000
4 2023-01-09 15:16:35 00:00 2023-01-09 15:28:43 00:00        1188  90000
  • Related