how to remove span tag and class name after scrapping whereas i want to scrape only text using pytho-CodePudding

for link in soup.findAll('li'):
    if "c-listing__authors-list" in str(link):
    # theAuthor = link.string
        theAuthor = str(link).replace("</p>","")
        theAuthor = theAuthor.split("</span>")[1]
        listAuthor.append(theAuthor)[Output][1]

CodePudding user response：

Try to use get_text(strip=True) to get your goal:

for e in soup.select('li span.c-listing__authors-list'):
    theAuthor = e.get_text(strip=True)

or to get a list in one line:

theAuthor = [e.get_text(strip=True) for e in soup.select('li span.c-listing__authors-list')]

Example

from bs4 import BeautifulSoup
html=''' 
<ul>
<li><span >a</span></li>
<li><span >b</span></li>
<li><span>no list</span></li>
</ul>  
'''
soup = BeautifulSoup(html)

theAuthor = []
for e in soup.select('li span.c-listing__authors-list'):
    theAuthor.append(e.get_text(strip=True))

Output

['a', 'b']

CodePudding user response：

This answer is Microsoft (.Net) centric but I'm hoping it may help point you in the right direction.

Its been a while since I've created a scraper. But I'm thinking this is possible if you also know your XPath as I recall being able to read a webpage into a HTMLDocument, accessing the element you require using XPath then obtaining the text value of it.