This is the answer that scrapes a particular section of an article on a website.
soup.find("div", {"id": "content_wrapper"}).text
I am supposed to replace each new line ('\n') in the body text with a whitespace (' '). I have done this with -soup.find("div", {"id": "content_wrapper"}).text.replace("\n", " ").strip()
But I still need to replace each of the '\xa0' and '\u200a' strings in the body text with a whitespace (' ') and Strip out all leading and trailing whitespaces.
How do I do this please?
Thank you!
CodePudding user response:
You just can add new replace methods after a replace method.
text = soup.find('div', {'id': 'content_wrapper'}).text
modified_text = text.replace('\n', ' ').replace('\xa0', ' ').replace('\u200a', ' ').strip()
If I understood correctly you want to remove these whitespaces too. Then, you shouldn't replace the words with whitespace " ". You should replace them with empty string, "".
text = soup.find('div', {'id': 'content_wrapper'}).text
modified_text = text.replace('\n', '').replace('\xa0', '').replace('\u200a', '').strip()
CodePudding user response:
all you need to do is check to see if it is in the text and write over it. like:
string = soup.find('div', {'id': 'content_wrapper'}).text
write = []
for i in string:
if i.find('\\xa0') == 0: i = ''
if i.find('\\u200a') == 0: i = ''
write.append(i)