Get specific tag of a XML with BeautifulSoup and Requests-CodePudding

I have this XML page that I'm trying to scrape, but I'm not able to get the content of some of the tags. The ones that drop down are possible, but the others are not.

This is the page I'm trying to scrape: https://g1.globo.com/rss/g1/

I'm trying to get the 'pubDate' tag and when I try find_all it comes back empty, when I try with find, it comes back as None. This is my code. I've have tried many ways, but have failed.

rss_globo = requests.get("https://g1.globo.com/rss/g1/").content
bs_globo = BeautifulSoup(rss_globo, 'lxml')
date = bs_globo.find_all('item')
for i in date:
    date = i.find('pubDate').getText()
    print(date)

CodePudding user response：

In order to work with xml you need a feature not a parser.

Here's how:

import requests
from bs4 import BeautifulSoup

bs_globo = BeautifulSoup(
    requests.get("https://g1.globo.com/rss/g1/").content,
    features="xml",
)
for i in bs_globo.find_all('item'):
    print(i.find('pubDate').getText())

Output:

Mon, 07 Mar 2022 15:05:42 -0000
Mon, 07 Mar 2022 15:04:53 -0000
Mon, 07 Mar 2022 15:04:41 -0000
Mon, 07 Mar 2022 15:03:38 -0000
Mon, 07 Mar 2022 15:03:15 -0000
Mon, 07 Mar 2022 15:01:14 -0000
Mon, 07 Mar 2022 15:00:37 -0000
Mon, 07 Mar 2022 15:00:26 -0000
Mon, 07 Mar 2022 15:00:09 -0000
Mon, 07 Mar 2022 15:00:04 -0000
Mon, 07 Mar 2022 14:59:32 -0000
Mon, 07 Mar 2022 14:58:46 -0000
Mon, 07 Mar 2022 14:58:04 -0000
Mon, 07 Mar 2022 14:58:02 -0000
Mon, 07 Mar 2022 14:55:24 -0000
Mon, 07 Mar 2022 14:51:20 -0000
Mon, 07 Mar 2022 14:50:45 -0000
Mon, 07 Mar 2022 14:50:22 -0000
Mon, 07 Mar 2022 14:50:07 -0000
Mon, 07 Mar 2022 14:49:01 -0000
Mon, 07 Mar 2022 14:47:23 -0000
Mon, 07 Mar 2022 14:47:21 -0000
Mon, 07 Mar 2022 14:46:34 -0000
Mon, 07 Mar 2022 14:46:31 -0000
Mon, 07 Mar 2022 14:45:45 -0000
Mon, 07 Mar 2022 14:45:02 -0000
Mon, 07 Mar 2022 14:44:37 -0000
Mon, 07 Mar 2022 14:44:16 -0000
Mon, 07 Mar 2022 14:43:37 -0000
Mon, 07 Mar 2022 14:42:56 -0000
Mon, 07 Mar 2022 14:42:39 -0000
Mon, 07 Mar 2022 14:42:16 -0000
Mon, 07 Mar 2022 14:41:51 -0000
Mon, 07 Mar 2022 14:41:41 -0000
Mon, 07 Mar 2022 14:41:35 -0000
Mon, 07 Mar 2022 14:41:09 -0000
Mon, 07 Mar 2022 14:40:38 -0000
Mon, 07 Mar 2022 14:39:27 -0000
Mon, 07 Mar 2022 14:39:15 -0000
Mon, 07 Mar 2022 14:39:13 -0000

CodePudding user response：

I want to point that python standard library also has tool for processing XML, namely xml.etree.ElementTree you might use it for this as follows

import xml.etree.ElementTree as ET
import requests
root = ET.fromstring(requests.get("https://g1.globo.com/rss/g1/").text)
for pubDate in root.findall("*/item/pubDate"):
    print(pubDate.text)

output

Mon, 07 Mar 2022 15:28:38 -0000
Mon, 07 Mar 2022 15:27:46 -0000
Mon, 07 Mar 2022 15:27:17 -0000
Mon, 07 Mar 2022 15:26:41 -0000
Mon, 07 Mar 2022 15:24:59 -0000
Mon, 07 Mar 2022 15:24:36 -0000
Mon, 07 Mar 2022 15:24:22 -0000
Mon, 07 Mar 2022 15:23:53 -0000
Mon, 07 Mar 2022 15:23:28 -0000
Mon, 07 Mar 2022 15:22:35 -0000
Mon, 07 Mar 2022 15:22:34 -0000
Mon, 07 Mar 2022 15:20:14 -0000
Mon, 07 Mar 2022 15:17:29 -0000
Mon, 07 Mar 2022 15:16:49 -0000
Mon, 07 Mar 2022 15:16:30 -0000
Mon, 07 Mar 2022 15:15:49 -0000
Mon, 07 Mar 2022 15:15:16 -0000
Mon, 07 Mar 2022 15:12:35 -0000
Mon, 07 Mar 2022 15:12:27 -0000
Mon, 07 Mar 2022 15:10:54 -0000
Mon, 07 Mar 2022 15:10:41 -0000
Mon, 07 Mar 2022 15:10:37 -0000
Mon, 07 Mar 2022 15:09:12 -0000
Mon, 07 Mar 2022 15:08:45 -0000
Mon, 07 Mar 2022 15:07:46 -0000
Mon, 07 Mar 2022 15:05:42 -0000
Mon, 07 Mar 2022 15:04:53 -0000
Mon, 07 Mar 2022 15:04:41 -0000
Mon, 07 Mar 2022 15:03:38 -0000
Mon, 07 Mar 2022 15:03:15 -0000
Mon, 07 Mar 2022 15:01:14 -0000
Mon, 07 Mar 2022 15:00:37 -0000
Mon, 07 Mar 2022 15:00:26 -0000
Mon, 07 Mar 2022 15:00:09 -0000
Mon, 07 Mar 2022 15:00:04 -0000
Mon, 07 Mar 2022 14:59:32 -0000
Mon, 07 Mar 2022 14:58:46 -0000
Mon, 07 Mar 2022 14:58:04 -0000
Mon, 07 Mar 2022 14:58:02 -0000
Mon, 07 Mar 2022 14:55:24 -0000

*/item/pubDate is path describing elements you want to access