I apologize for my english. I have a Python script in which a forloop always deletes the last letter if this string has a special ending.It is a kind of web scraper that should turn the text of a web page into a word list.But when I try to delete the last character of a string it doesn't work and I get this error message:
Traceback (most recent call last):
File "/home/kali/Scripts/Web Scraping/webscraper.py", line 29, in <module>
if elem[-1] == x:
IndexError: string index out of range
What can I do against it?
My code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://test-website.com"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
sonderzeichen = ["?",".","[","]","{","}","|","#","&","*","/",":",";"," ","-","_","=","<",">","..."]
word_list = text.split()
for elem in list(word_list):
for x in sonderzeichen:
if elem[-1] == x:
elem = elem[:-1:]
print(elem)
print(word_list)
CodePudding user response:
If I understand you correctly, you want to remove all trailing sonderzeichen
characters from all words in word_list
. If all members of sonderzeichen
were single characters, this could be done like this:
word_list = [
word[:-1] if word[-1] in sonderzeichen else word
for word in word_list
]
This uses list-comprehension, an inline if-else-statement, and a membership check.
In your question, however, sonderzeichen
has "..."
- contradicting the name of the variable (German for "special characters"), and the list comprehension would not work with this.