Home > Net >  BeautifulSoup4 extracting extra text not shown in Element Inspector
BeautifulSoup4 extracting extra text not shown in Element Inspector

Time:02-24

I have a script that tries to get a Twitter handle's name. Here's the script:

import requests, re, bs4, lxml
from bs4 import BeautifulSoup
url = 'https://web.archive.org/web/20150623154546/https:/twitter.com/biz/status/600839913286672384'

r = requests.get(url).text
c =  re.compile('.*fullname js-action-profile-name show-popup-with-id.*')
user = bs4.BeautifulSoup(r, "lxml").find("strong", {"class": c}).getText()
print(user)

Now, this is what the Element Inspector looks like: ElementInspector

I want to print only the text highlighted above in yellow (Biz Stone). However, this is my Python output:

Biz StoneVerified account

As you can see, the string "Verified account" is appended, for an unfathomable reason. The text in Inspect Element only shows "Biz Stone". Why does this occur, and how can I fix it?

I tried fixing it by following the instructions in this solution and this one. Unfortunately, the first one (adding text=True) printed random people's names, while adding to it "recursive=False" as in the second solution prints nothing.

CodePudding user response:

It adds Verified account because within the Strong tag is a span tag which likely has text within it somewhere that says Verified account

.getText() will get all the text within that tag

When I've encountered this issue, I've got the text from the span tag first then got the text from the strong tag and then replaced the span text with a blank string

CodePudding user response:

get_text gets all text inside the found tag ("strong" tag contains "span" tags as children, one of which has "Verified account" as text). However, since you don't want to look through the children of "strong" tag, you can do that by setting the parameter recursive=False:

soup = BeautifulSoup(r, "lxml")
out = soup.find("strong", {'class': c}).find(text=True, recursive=False)

Output:

'Biz Stone'
  • Related