How can I delete a big part of a string from a scraped page?-CodePudding

I made a web scraper to get the informative text of a Wikipedia page. I get the text I want but I want to cut off a big part of the bottom text. I already tried some other solutions but with those, I don't get the headers and white-spaces I need.

import requests
from bs4 import BeautifulSoup
import re


website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")

text = list()

text.extend(soup.findAll('mw-content-text'))

text_content = soup.text
text_content = re.sub(r'==.*?== ', '', text_content)
# text_content = text.replace('\n', '')

print(text_content)

Here, soup.text is all the text of the wikipedia page with the class='mw-content-text' printed as a string. This prints the overall text I need but I need to cut off the string where it starts showing the text of the sources. I already tried the replace method but it didn't do anything.

Given this page, I want to cut of what's under the red line in the big string of text I have scraped

I tried something like this, which didn't work:

  for content in soup('span', {'class': 'mw-content-text'}):
      print(content.text)
      text = content.findAll('p', 'a')
      for t in text:
          print(text.text)```

I also tried this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests

website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')

text = ''

for content in soup.find_all('p'):
    text  = content.text

text = re.sub(r'\[.*?\] ', '', text)
text = text.replace('\n', '')

# print(text)

but these approaches just gave me an unreadable mess of text. I still want the whitespaces and headers that my base code gives me.

CodePudding user response：

Think it is still a bit abstract but you could get your goal while iterating over all children and break if tag with class appendix appears:

for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
    if c.get('class') and 'appendix' in c.get('class'):
        break
    print(c.get_text(strip=True))

Example

import requests
from bs4 import BeautifulSoup    

website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text)

for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
    if c.get('class') and 'appendix' in c.get('class'):
        break
    print(c.get_text(strip=True))

CodePudding user response：

There is likely a more efficient solution but here is a list comprehension that solves your issue:

# the rest of your code
references = [line for line in text_content.split('\n') if line.startswith("↑")]

Heres an alternative version that might be easier to understand:

# the rest of your code

# Turn text_content into a list of lines
text_content = text_content.split('\n')

references = []

# Iterate through each line and only save the values that start 
# with the symbol used for each reference, on wikipedia: "↑" 
# ( or "^" for english wikipedia pages )

for line in text_content:
    if line.startswith("↑"):
        references.append(line)

Both scripts will do the same thing.