I'm trying to get the dates from the soup that is scraping the google NEWS page
date_section = soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})
print(date_section)
output:
[<div style="bottom:0px"><span>30 May 2020</span></div>,
<div style="bottom:0px"><span>3 weeks ago</span></div>,
<div style="bottom:0px"><span>1 week ago</span></div>,
<div style="bottom:0px"><span>2 weeks ago</span></div>,
<div style="bottom:0px"><span>22 Nov 2020</span></div>,
<div style="bottom:0px"><span>19 Mar 2019</span></div>,
<div style="bottom:0px"><span>18 Mar 2019</span></div>,
<div style="bottom:0px"><span>11 Aug 2019</span></div>,
<div style="bottom:0px"><span>1 Aug 2019</span></div>,
<div style="bottom:0px"><span>4 Jun 2009</span></div>]
I want to get a list
of all the dates from this structure.
This is how I'm currently accessing the date and can get a list
of dates by looping. I was wondering if there is more elegant way of working with BeautifulSoup
to access the dates in such a structure.
print("First date",date_section[0].text)
CodePudding user response:
Try:
from bs4 import BeautifulSoup
html ='''
<div style="bottom:0px"><span>30 May 2020</span></div>,
<div style="bottom:0px"><span>3 weeks ago</span></div>,
<div style="bottom:0px"><span>1 week ago</span></div>,
<div style="bottom:0px"><span>2 weeks ago</span></div>,
<div style="bottom:0px"><span>22 Nov 2020</span></div>,
<div style="bottom:0px"><span>19 Mar 2019</span></div>,
<div style="bottom:0px"><span>18 Mar 2019</span></div>,
<div style="bottom:0px"><span>11 Aug 2019</span></div>,
<div style="bottom:0px"><span>1 Aug 2019</span></div>,
<div style="bottom:0px"><span>4 Jun 2009</span></div>
'''
soup= BeautifulSoup(html, 'lxml')
date_section = soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})
for d in date_section:
print(d.text)
Output:
30 May 2020
3 weeks ago
1 week ago
2 weeks ago
22 Nov 2020
19 Mar 2019
18 Mar 2019
11 Aug 2019
1 Aug 2019
4 Jun 2009
CodePudding user response:
I want to get a list of all the dates from this structure.
To get a list
simply iterate your ResultSet
e.g. with list comprehension
:
[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]
or with css selectors
:
[e.get_text(strip=True) for e in soup.select('div.OSrXXb.ZE0LJd.YsWzw span')]
both will lead to:
['30 May 2020', '3 weeks ago', '1 week ago', '2 weeks ago', '22 Nov 2020', '19 Mar 2019', '18 Mar 2019', '11 Aug 2019', '1 Aug 2019', '4 Jun 2009']
Example
from bs4 import BeautifulSoup
html ='''
<div style="bottom:0px"><span>30 May 2020</span></div>
<div style="bottom:0px"><span>3 weeks ago</span></div>
<div style="bottom:0px"><span>1 week ago</span></div>
<div style="bottom:0px"><span>2 weeks ago</span></div>
<div style="bottom:0px"><span>22 Nov 2020</span></div>
<div style="bottom:0px"><span>19 Mar 2019</span></div>
<div style="bottom:0px"><span>18 Mar 2019</span></div>
<div style="bottom:0px"><span>11 Aug 2019</span></div>
<div style="bottom:0px"><span>1 Aug 2019</span></div>
<div style="bottom:0px"><span>4 Jun 2009</span></div>
'''
soup= BeautifulSoup(html)
[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]