I have this XML page that I'm trying to scrape, but I'm not able to get the content of some of the tags. The ones that drop down are possible, but the others are not.
This is the page I'm trying to scrape: https://g1.globo.com/rss/g1/
I'm trying to get the 'pubDate' tag and when I try find_all it comes back empty, when I try with find, it comes back as None. This is my code. I've have tried many ways, but have failed.
rss_globo = requests.get("https://g1.globo.com/rss/g1/").content
bs_globo = BeautifulSoup(rss_globo, 'lxml')
date = bs_globo.find_all('item')
for i in date:
date = i.find('pubDate').getText()
print(date)
CodePudding user response:
In order to work with xml
you need a feature
not a parser
.
Here's how:
import requests
from bs4 import BeautifulSoup
bs_globo = BeautifulSoup(
requests.get("https://g1.globo.com/rss/g1/").content,
features="xml",
)
for i in bs_globo.find_all('item'):
print(i.find('pubDate').getText())
Output:
Mon, 07 Mar 2022 15:05:42 -0000
Mon, 07 Mar 2022 15:04:53 -0000
Mon, 07 Mar 2022 15:04:41 -0000
Mon, 07 Mar 2022 15:03:38 -0000
Mon, 07 Mar 2022 15:03:15 -0000
Mon, 07 Mar 2022 15:01:14 -0000
Mon, 07 Mar 2022 15:00:37 -0000
Mon, 07 Mar 2022 15:00:26 -0000
Mon, 07 Mar 2022 15:00:09 -0000
Mon, 07 Mar 2022 15:00:04 -0000
Mon, 07 Mar 2022 14:59:32 -0000
Mon, 07 Mar 2022 14:58:46 -0000
Mon, 07 Mar 2022 14:58:04 -0000
Mon, 07 Mar 2022 14:58:02 -0000
Mon, 07 Mar 2022 14:55:24 -0000
Mon, 07 Mar 2022 14:51:20 -0000
Mon, 07 Mar 2022 14:50:45 -0000
Mon, 07 Mar 2022 14:50:22 -0000
Mon, 07 Mar 2022 14:50:07 -0000
Mon, 07 Mar 2022 14:49:01 -0000
Mon, 07 Mar 2022 14:47:23 -0000
Mon, 07 Mar 2022 14:47:21 -0000
Mon, 07 Mar 2022 14:46:34 -0000
Mon, 07 Mar 2022 14:46:31 -0000
Mon, 07 Mar 2022 14:45:45 -0000
Mon, 07 Mar 2022 14:45:02 -0000
Mon, 07 Mar 2022 14:44:37 -0000
Mon, 07 Mar 2022 14:44:16 -0000
Mon, 07 Mar 2022 14:43:37 -0000
Mon, 07 Mar 2022 14:42:56 -0000
Mon, 07 Mar 2022 14:42:39 -0000
Mon, 07 Mar 2022 14:42:16 -0000
Mon, 07 Mar 2022 14:41:51 -0000
Mon, 07 Mar 2022 14:41:41 -0000
Mon, 07 Mar 2022 14:41:35 -0000
Mon, 07 Mar 2022 14:41:09 -0000
Mon, 07 Mar 2022 14:40:38 -0000
Mon, 07 Mar 2022 14:39:27 -0000
Mon, 07 Mar 2022 14:39:15 -0000
Mon, 07 Mar 2022 14:39:13 -0000
CodePudding user response:
I want to point that python standard library also has tool for processing XML, namely xml.etree.ElementTree
you might use it for this as follows
import xml.etree.ElementTree as ET
import requests
root = ET.fromstring(requests.get("https://g1.globo.com/rss/g1/").text)
for pubDate in root.findall("*/item/pubDate"):
print(pubDate.text)
output
Mon, 07 Mar 2022 15:28:38 -0000
Mon, 07 Mar 2022 15:27:46 -0000
Mon, 07 Mar 2022 15:27:17 -0000
Mon, 07 Mar 2022 15:26:41 -0000
Mon, 07 Mar 2022 15:24:59 -0000
Mon, 07 Mar 2022 15:24:36 -0000
Mon, 07 Mar 2022 15:24:22 -0000
Mon, 07 Mar 2022 15:23:53 -0000
Mon, 07 Mar 2022 15:23:28 -0000
Mon, 07 Mar 2022 15:22:35 -0000
Mon, 07 Mar 2022 15:22:34 -0000
Mon, 07 Mar 2022 15:20:14 -0000
Mon, 07 Mar 2022 15:17:29 -0000
Mon, 07 Mar 2022 15:16:49 -0000
Mon, 07 Mar 2022 15:16:30 -0000
Mon, 07 Mar 2022 15:15:49 -0000
Mon, 07 Mar 2022 15:15:16 -0000
Mon, 07 Mar 2022 15:12:35 -0000
Mon, 07 Mar 2022 15:12:27 -0000
Mon, 07 Mar 2022 15:10:54 -0000
Mon, 07 Mar 2022 15:10:41 -0000
Mon, 07 Mar 2022 15:10:37 -0000
Mon, 07 Mar 2022 15:09:12 -0000
Mon, 07 Mar 2022 15:08:45 -0000
Mon, 07 Mar 2022 15:07:46 -0000
Mon, 07 Mar 2022 15:05:42 -0000
Mon, 07 Mar 2022 15:04:53 -0000
Mon, 07 Mar 2022 15:04:41 -0000
Mon, 07 Mar 2022 15:03:38 -0000
Mon, 07 Mar 2022 15:03:15 -0000
Mon, 07 Mar 2022 15:01:14 -0000
Mon, 07 Mar 2022 15:00:37 -0000
Mon, 07 Mar 2022 15:00:26 -0000
Mon, 07 Mar 2022 15:00:09 -0000
Mon, 07 Mar 2022 15:00:04 -0000
Mon, 07 Mar 2022 14:59:32 -0000
Mon, 07 Mar 2022 14:58:46 -0000
Mon, 07 Mar 2022 14:58:04 -0000
Mon, 07 Mar 2022 14:58:02 -0000
Mon, 07 Mar 2022 14:55:24 -0000
*/item/pubDate
is path describing elements you want to access