Home > Enterprise >  Datetime Time Zone Scraping Python
Datetime Time Zone Scraping Python

Time:06-02

I am trying to scrape and sort articles with a body, headline, and date column. However, when pulling the date, I’m running into an error with the time zone:

ValueError: time data 'Jun 1, 2022 2:49PM EDT' does not match format '%b %d, %Y %H:%M%p %z'

My code is as follows:

def get_info(url):
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    news = soup.find('div', attrs={'class': 'body__content'}).text
    headline = soup.find('h1').text
    date = datetime.datetime.strptime(soup.find('time').text, "%b %d, %Y %H:%M%p %z")
    columns = [news, headline, date]
    column_names = ['News','Headline','Date']
    return dict(zip(column_names, columns))

Is there a way to grab the time zone in a similar method or just drop it overall?

CodePudding user response:

Note %z in strptime() is for timezone offsets not names and %Z only accepts certain values for time zones. For details see API docs.

Simplest option is to use dateparser module to parse dates with time zone names (e.g. EDT).

import dateparser

s = "Jun 1, 2022 2:49PM EDT"
d =  dateparser.parse(s)
print(d)

Output:

2022-06-01 14:49:00-04:00

Many of the date modules (e.g. dateutil and pytz) have timezone offsets defined for "EST", "PST", etc. but "EDT" is less common. These modules would need you to define the timezone with the offset as UTC-04:00.

import dateutil.parser

s = "Jun 1, 2022 2:49PM EDT"
tzinfos = {"EDT": -14400}
d = dateutil.parser.parse(s, tzinfos=tzinfos)
print(d)

Output:

2022-06-01 14:49:00-04:00

CodePudding user response:

As alternate to @CodeMonkey solution, you may also try it by pandas :

pd.to_datetime('Jun 1, 2022 2:49PM EDT').tz_localize('US/Eastern')
  • Related