Home > Enterprise >  Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas
Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas

Time:11-28

This was part of another question (Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. Now posting this part as a separate question.

In the code below, I'm pasting 3 example source URLs into the code and it works. But I have a long list of URLs (1000 ) to scrape and they are stored in a single first column of a .csv file (let's call it 'urls.csv'). I would prefer to read directly from that file.

I think I know the basic structure of 'with open' (e.g. the way @bguest answered it below), but I'm having problems how to link that to the rest of the code, so that the rest continues to work. How can I replace the list of URLs with iterative reading of .csv, so that I'm passing the URLs correctly into the code?

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

CodePudding user response:

Since you're using pandas, read_csv will do the trick for you: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

If you want to write it on your own, you could use the built in csv library

import csv

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])

Edit: was asked how to make the rest of the code use the urls from the csv.

First put the urls in a urls.csv file

url
https://www.google.com
https://www.facebook.com

Now gather the urls from the csv

import csv

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)

    urls = [row["url"] for row in reader]

# remove the following lines
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']

Now the urls should be used by the rest of the code

  • Related