Home > Back-end >  Scraping a list of urls using beautifulsoup and convert data to csv
Scraping a list of urls using beautifulsoup and convert data to csv

Time:12-17

I am new to Python. Questions below:

  1. I have a list of urls I want to scrape data from. I don't know what is wrong with my code, I am unable to retrieve results from all urls. The code is only scraping the first url and not the rest. How can I successfully scrape data (title, info, description, application) in all urls in the list?

  2. If question 1 works, how can I convert the data into a CSV file?

Here is the code:

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

urlList = ["url1","url2","url3"...lists of urls.......]

for url in urlList:
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
    except URLError:
        print("error")
    else:
        soup = BeautifulSoup(html.read(),"html5lib")
# Scraping
def getTitle():
    for title in soup.find('h2', class_="xx").text:
            print(title)

def getInfo():
   for info in soup.find('ul', class_="j-k-i").text:
        print(info)

def getDescription():
    for description in soup.find('div', class_="b-d").text:
        print(description)

def getApplication():
    for application in soup.find('div', class_="g-b bm-b-30").text:
       print(application)

for soups in soup():
    getTitle()
    getInfo()
    getDescription()
    getApplication()

CodePudding user response:

Try the following kind of approach:

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
import csv


def getTitle(soup):
    return soup.find('h2', class_="xx").text

def getInfo(soup):
    return soup.find('ul', class_="j-k-i").text

def getDescription(soup):
    return soup.find('div', class_="b-d").text

def getApplication(soup):
    return soup.find('div', class_="g-b bm-b-30").text

urlList = ["url1","url2","url3"...lists of urls.......]

with open('output.csv', 'w', newline='')  as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(['Title', 'Info', 'Desc', 'Application'])
    
    for url in urlList:
        try:
            html = urlopen(url)
        except HTTPError as e:
            print(e)
        except URLError:
            print("error")
        else:
            soup = BeautifulSoup(html.read(),"html5lib")
            row = [getTitle(soup), getInfo(soup), getDescription(soup), getApplication(soup)]
            print(row)
            csv_output.writerow(row)
            

This passes the current soup to each function to use. Each function now returns the text that is found (previously the for loop was printing a character at a time).

Lastly, Python's csv library can be used to easily write a correctly formatted CSV file. It takes a list of values for each row and writes by default a comma separated row in output.csv.

Note: not tested as you have not provided any suitable URLs.

  • Related