Home > Blockchain >  How to access the text of the <span> tag with BeautifulSoup
How to access the text of the <span> tag with BeautifulSoup

Time:11-11

I'm trying to get the dates from the soup that is scraping the google NEWS page

date_section = soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})
print(date_section)

output:

[<div  style="bottom:0px"><span>30 May 2020</span></div>, 
<div  style="bottom:0px"><span>3 weeks ago</span></div>,
<div  style="bottom:0px"><span>1 week ago</span></div>, 
<div  style="bottom:0px"><span>2 weeks ago</span></div>, 
<div  style="bottom:0px"><span>22 Nov 2020</span></div>, 
<div  style="bottom:0px"><span>19 Mar 2019</span></div>, 
<div  style="bottom:0px"><span>18 Mar 2019</span></div>, 
<div  style="bottom:0px"><span>11 Aug 2019</span></div>, 
<div  style="bottom:0px"><span>1 Aug 2019</span></div>, 
<div  style="bottom:0px"><span>4 Jun 2009</span></div>]

I want to get a list of all the dates from this structure.

This is how I'm currently accessing the date and can get a list of dates by looping. I was wondering if there is more elegant way of working with BeautifulSoup to access the dates in such a structure.

print("First date",date_section[0].text)

CodePudding user response:

Try:

from bs4 import BeautifulSoup

html ='''
<div  style="bottom:0px"><span>30 May 2020</span></div>, 
<div  style="bottom:0px"><span>3 weeks ago</span></div>,
<div  style="bottom:0px"><span>1 week ago</span></div>, 
<div  style="bottom:0px"><span>2 weeks ago</span></div>, 
<div  style="bottom:0px"><span>22 Nov 2020</span></div>, 
<div  style="bottom:0px"><span>19 Mar 2019</span></div>, 
<div  style="bottom:0px"><span>18 Mar 2019</span></div>, 
<div  style="bottom:0px"><span>11 Aug 2019</span></div>, 
<div  style="bottom:0px"><span>1 Aug 2019</span></div>, 
<div  style="bottom:0px"><span>4 Jun 2009</span></div>
'''

soup= BeautifulSoup(html, 'lxml')

date_section = soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})
for d in date_section:
    print(d.text)

Output:

30 May 2020
3 weeks ago
1 week ago
2 weeks ago
22 Nov 2020
19 Mar 2019
18 Mar 2019
11 Aug 2019
1 Aug 2019
4 Jun 2009

CodePudding user response:

I want to get a list of all the dates from this structure.

To get a list simply iterate your ResultSet e.g. with list comprehension:

[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]

or with css selectors:

[e.get_text(strip=True) for e in soup.select('div.OSrXXb.ZE0LJd.YsWzw span')]

both will lead to:

['30 May 2020', '3 weeks ago', '1 week ago', '2 weeks ago', '22 Nov 2020', '19 Mar 2019', '18 Mar 2019', '11 Aug 2019', '1 Aug 2019', '4 Jun 2009']

Example

from bs4 import BeautifulSoup

html ='''
<div  style="bottom:0px"><span>30 May 2020</span></div>
<div  style="bottom:0px"><span>3 weeks ago</span></div>
<div  style="bottom:0px"><span>1 week ago</span></div>
<div  style="bottom:0px"><span>2 weeks ago</span></div>
<div  style="bottom:0px"><span>22 Nov 2020</span></div>
<div  style="bottom:0px"><span>19 Mar 2019</span></div>
<div  style="bottom:0px"><span>18 Mar 2019</span></div>
<div  style="bottom:0px"><span>11 Aug 2019</span></div>
<div  style="bottom:0px"><span>1 Aug 2019</span></div>
<div  style="bottom:0px"><span>4 Jun 2009</span></div>
'''

soup= BeautifulSoup(html)

[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]
  • Related