I am new to Python. Questions below:
I have a list of urls I want to scrape data from. I don't know what is wrong with my code, I am unable to retrieve results from all urls. The code is only scraping the first url and not the rest. How can I successfully scrape data (title, info, description, application) in all urls in the list?
If question 1 works, how can I convert the data into a CSV file?
Here is the code:
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
urlList = ["url1","url2","url3"...lists of urls.......]
for url in urlList:
try:
html = urlopen(url)
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
soup = BeautifulSoup(html.read(),"html5lib")
# Scraping
def getTitle():
for title in soup.find('h2', class_="xx").text:
print(title)
def getInfo():
for info in soup.find('ul', class_="j-k-i").text:
print(info)
def getDescription():
for description in soup.find('div', class_="b-d").text:
print(description)
def getApplication():
for application in soup.find('div', class_="g-b bm-b-30").text:
print(application)
for soups in soup():
getTitle()
getInfo()
getDescription()
getApplication()
CodePudding user response:
Try the following kind of approach:
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
import csv
def getTitle(soup):
return soup.find('h2', class_="xx").text
def getInfo(soup):
return soup.find('ul', class_="j-k-i").text
def getDescription(soup):
return soup.find('div', class_="b-d").text
def getApplication(soup):
return soup.find('div', class_="g-b bm-b-30").text
urlList = ["url1","url2","url3"...lists of urls.......]
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['Title', 'Info', 'Desc', 'Application'])
for url in urlList:
try:
html = urlopen(url)
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
soup = BeautifulSoup(html.read(),"html5lib")
row = [getTitle(soup), getInfo(soup), getDescription(soup), getApplication(soup)]
print(row)
csv_output.writerow(row)
This passes the current soup
to each function to use. Each function now returns the text that is found (previously the for loop was printing a character at a time).
Lastly, Python's csv
library can be used to easily write a correctly formatted CSV file. It takes a list of values for each row and writes by default a comma separated row in output.csv
.
Note: not tested as you have not provided any suitable URLs.