Home > Software design >  How to filter out data based on date in python of a csv file
How to filter out data based on date in python of a csv file

Time:05-07

I have a data set as of below & I want to filter data from 2021-07-30 to 2021-08-03 Below is the dataset

input.csv

created_at,text,label
2021-07-24,Newzeland Wins the worldcup,Sport
2021-07-25,ABC Wins the worldcup,Sport
2021-07-26,Hello the worldcup,Sport
2021-07-27,Cricket worldcup,Sport
2021-07-28,Rugby worldcup,Sport
2021-07-29,LLL Wins,Sport
2021-07-30,MMM Wins the worldcup,Sport
2021-07-31,RRR Wins the worldcup,Sport
2021-08-01,OOO Wins the worldcup,Sport
2021-08-02,JJJ Wins the worldcup,Sport
2021-08-03,YYY Wins the worldcup,Sport
2021-08-04,KKK Wins the worldcup,Sport
2021-08-05,YYY Wins the worldcup,Sport
2021-08-06,GGG Wins the worldcup,Sport
2021-08-07,FFF Wins the worldcup,Sport
2021-08-08,SSS Wins the worldcup,Sport
2021-08-09,XYZ Wins the worldcup,Sport
2021-08-10,PQR Wins the worldcup,Sport

output.csv

created_at,text,label
2021-07-30,MMM Wins the worldcup,Sport
2021-07-31,RRR Wins the worldcup,Sport
2021-08-01,OOO Wins the worldcup,Sport
2021-08-02,JJJ Wins the worldcup,Sport
2021-08-03,YYY Wins the worldcup,Sport

import pandas as pd
def save():
    tweets = pd.read_csv(r'input.csv.csv')
    df = pd.DataFrame(tweets, columns=['created_at', 'text','label'])

if __name__ == '__main__':
    save()

CodePudding user response:

df[(df.created_at >= '2021-07-30') & (df.created_at <= '2021-08-03')]

Output:

    created_at                   text  label
6   2021-07-30  MMM Wins the worldcup  Sport
7   2021-07-31  RRR Wins the worldcup  Sport
8   2021-08-01  OOO Wins the worldcup  Sport
9   2021-08-02  JJJ Wins the worldcup  Sport
10  2021-08-03  YYY Wins the worldcup  Sport

CodePudding user response:

Try:

df = pd.read_csv('input.csv', parse_dates=['created_at'])
out = df[df['created_at'].between('2021-07-30', '2021-08-03')]
out.to_csv('output.csv', index=False)

Content of output.csv:

created_at,text,label
2021-07-30,MMM Wins the worldcup,Sport
2021-07-31,RRR Wins the worldcup,Sport
2021-08-01,OOO Wins the worldcup,Sport
2021-08-02,JJJ Wins the worldcup,Sport
2021-08-03,YYY Wins the worldcup,Sport

CodePudding user response:

Another way to achieve this, perhaps even faster:

df = pd.DataFrame({'data':['SPY', 'SPY','SPY', 'SPY', 'SPY', 'SPY','SPY'],
                   'created_at': ['2021-07-30', '2021-07-31', '2021-08-01', '2021-08-02', '2010-05-06', '2021-08-03', '2021-08-04']
                
                  })

df = df.assign(created_at=pd.to_datetime(df['created_at']))

df.query("created_at > '2021-07-30' & created_at < '2021-08-03'")

  • Related