Pandas dataframe list iteration with datetime-CodePudding

I have a job which runs every 15 mins and returns an IP array list like this

listofIPs = ['1.1.1.1', '2.2.2.2', '3.3.3.3']

I maintain this in a masterlist.csv file against the date like this.

IP,Timestamp
1.1.1.1,2022-05-12_2030
2.2.2.2,2022-05-12_2030
3.3.3.3,2022-05-12_2030
5.5.5.5,2022-05-12_1430
8.8.8.8,2022-05-11_1930
1.1.1.1,2022-05-06_2030

Every 15 mins the job returns some IP list.We need to check if any IP from the list is NOT present in master csv in last 24 hours and then only append that IP to master csv file

I want to achieve this using Pandas dataframes or basic python,and not Pyspark etc

I have tried something like this but it didn't work

   mainfile = '/Users/Documents/masterlist.csv'
   runtime = datetime.strftime(now , '%Y-%m-%d_%H%M')
   listofIPs = ['1.1.1.1', '2.2.2.2', '3.3.3.3']
  
   dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d_%H%M')
   df = pd.read_csv(mainfile, parse_dates=['Timestamp'], date_parser=dateparse)
   recent_df = df[df.timestamp > datetime.now() - pd.to_timedelta("1day")]
   badip_df = recent_df.drop_duplicates("IP", keep='last')
   
   if badip_df.empty:
        badip_df['Timestamp'] = runtime
        badip_df = badip_df[[listofIPs, 'Timestamp']]
        badip_df.to_csv(mainfile, index=False)
    else:
        badip_df.to_csv(mainfile, mode='a', index=False, header=False)

Can someone please help?

CodePudding user response：

datetime.strftime, datetime.strptime, and datetime.now() don't exist. All of those are methods of datetime.datetime, e.g. datetime.datetime.now() [EDIT: oh, unless you imported using from datetime import datetime].

Also, as mentioned in the comments, goodfp_df is not defined before it is first used. Fix those problems, and you'll be somewhere. Is it supposed to be defined as recent_df[~recentdf.duplicated(keep=False)], by any chance?

Also, df.timestamp should be df.Timestamp, and now isn't defined (should be datetime.now()).

Finally, listofIPs is a list and thus can't be a column title of a dataframe.

CodePudding user response：

Is this what you're looking for? I'm using the masterlist.csv described above as a starting point.

import pandas as pd
from datetime import datetime

mainfile = 'masterlist.csv'
runtime = datetime.strftime(datetime.now() , '%Y-%m-%d_%H%M')
listofIPs = ['1.1.1.1', '2.2.2.2', '3.3.3.3']
  
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d_%H%M')
df = pd.read_csv(mainfile, parse_dates=['Timestamp'], date_parser=dateparse)
recent_df = df[df.Timestamp > datetime.now() - pd.to_timedelta("30day")]  #30day to get all
badip_df = recent_df.drop_duplicates("IP", keep='last')
   
if badip_df.empty:
    badip_df = pd.DataFrame(listofIPs, columns=['IP'])
    badip_df['Timestamp'] = runtime
    badip_df.to_csv(mainfile, index=False)
else:
    badip_df.to_csv(mainfile, mode='a', index=False, header=False)

When pd.to_timedelta() is set as "30day", the output is

When pd.to_timedelta() is set as "5day", the output is