Home > Blockchain >  Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Panda
Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Panda

Time:11-28

I got this code to almost work, despite much ignorance. Please help on the home run!

  • Problem 1: INPUT:

I have a long list of URLs (1000 ) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.

  • Problem 2: OUTPUT:

The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).

  • Problem 3: OUTPUT:

I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.

At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

The inputs look like this in each URL. They are just lists:

Market drivers

  • Growing investment in fabs
  • Miniaturization of electronic products
  • Increasing demand for IoT devices

Market challenges

  • Rapid technological changes in semiconductor industry
  • Volatility in semiconductor industry
  • Impact of technology chasm Table Impact of drivers and challenges

My desired output for drivers is:

0 1 2 3
http/.../Global-Induction-Hobs-30196623/ Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices

But instead I get:

0 1 2 3 4 5 6
http/.../Global-Induction-Hobs-30196623/ Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Miniaturization of electronic products Increasing demand for IoT devices

CodePudding user response:

Store your data in a list of dicts, create a data frame from it. Split the list of drivers / challenges into single columns and concat it to the final data frame.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')

Output

url type 0 1 2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ driver Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ challenges High cost limiting the adoption in the mass segment Health hazards related to induction hobs Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ driver Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ challenges Threat from open-source software High implementation and maintenance cost Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ driver Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ challenges Rapid technological changes in semiconductor industry Volatility in semiconductor industry Impact of technology chasm
  • Related