Complete newbie but I've managed to successfully scrape EAN numbers with Python from a list of links created by an upstream piece of code. However, my output file contains all the scraped numbers as a continuous single line instead of one EAN per line.
Here's my code - what's wrong with it? (scraped URL redacted)
import requests
from bs4 import BeautifulSoup
import urllib.request
import os
subpage = 1
while subpage <= 2:
URL = "https://..." str(subpage)
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
"""writes all links under the h2 tag into a list"""
links = []
h2s = soup.find_all("h2")
for h2 in h2s:
links.append("http://www.xxxxxxxxxxx.com" h2.a['href'])
"""opens links from list and extracts EAN number from underlying page"""
with open("temp.txt", "a") as output:
for link in links:
urllib.request.urlopen(link)
page_2 = requests.get(link)
soup_2 = BeautifulSoup(page_2.content, "html.parser")
if "EAN:" in soup_2.text:
span = soup_2.find(class_="articleData_ean")
EAN = span.a.text
output.write(EAN)
subpage = 1
os.replace('temp.txt', 'EANs.txt')
CodePudding user response:
output.write(EAN)
is writing each EAN without anything between them. It doesn't automatically add a separator or newline. You can add a newline: output.write('\n')
or comma, etc. to separate them