I am trying to scrape a website that I put into a local html file. When I use the find_all() method I can get all the tags' text displayed on the python results. The problem is that I can't get it to display all the text in a .txt file.
from bs4 import BeautifulSoup
def interest_retrieval(filename): with open(f'{filename}', 'r') as html_file: content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
Python will display all the tags as a text but when I write to the .txt file it only will display the last last tag. output of txt document
Edit I would also like to do a similar thing but with a docx file. I took Igor's suggested code but changed the parts into what I would need for a docx file. But I'm still having the same issue with the docx file.
from bs4 import BeautifulSoup
import docx
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
mydoc = docx.Document()
mydoc.add_paragraph(f'{interest.text}')
mydoc.save("C:/Users\satam\PycharmProjects\pythonProject\Web Scraper\list.docx")
print(interest.text)
CodePudding user response:
You reopen the file in write mode in every iteration; this overwrites its previous contents. Either open it just once and place the loop within the with
block, or open it with the a
mode (a for "append"; open('interest.txt', 'a')
).
(The former is likely preferable in this case as it seems there's no reason to keep opening and closing the file again and again while you're continuously writing to it.)
CodePudding user response:
Every iteration rewrites the interest.txt
file.
You just need to take the with open...
part out of the for
loop.
Try replacing this fragment
for interest in interests:
with open ('interest.txt', 'w') as file:
file.write(f'{interest.text}')
print(interest.text)
with the following code:
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)
Here is the complete code:
from bs4 import BeautifulSoup
def interest_retrieval(filename):
with open(f'{filename}', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
interests = soup.find_all('h2')
with open('interest.txt', 'w') as file:
for interest in interests:
file.write(f'{interest.text}')
print(interest.text)