Home > Enterprise >  BeautifulSoup: IndexError: list index out of range for Multiple Links to Scrape
BeautifulSoup: IndexError: list index out of range for Multiple Links to Scrape

Time:09-18

I am working on a scraper, it works well on multiple URLs (it takes some basic info and multiple links associated with each URL), BUT since each URL has a different number of links associated with each profile (e.g. one URL can have 3 social links, another URL only 1 social link, then another URL with 2 social links) whenever I run it, as soon as the scraper encounters a URL without enough social links, it says "social2 = links[1] IndexError: list index out of range"

I am pretty sure I have to create a "IF" that tells the scraper to still scrape all the other info but to skip a column if the data (in this case, social links) is not present or are less then 2,3,4, etc so that if a URL has only 1 link or 2 links but not all the one I am calling, the scraper should write "None" for that specific link

Also, I'd like to tell the scraper to continue looking for all social links until the URL doesn't have anymore because I don't know how many MAX links each URL can actually have.

I'd like to see something like this in my CSV:

Name Location Link Link 2 Link 3
Mark Red Los Angeles https://instagram.com/markred None https://tiktok.com/@markred
Mary Green New York https://instagram.com/marygreen https://youtuebe.com/marygreen None

My code is:

from bs4 import BeautifulSoup
import requests
from csv import writer

urls = ['https://url.com/1','https://url.com/2', 'https://url.com/3']

with open('multi.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Name', 'Location', 'Link', 'Link2', 'Link3']
    thewriter.writerow(header)

    for url in urls:
        my_url = requests.get(url)
        html = my_url.content
        soup = BeautifulSoup(html,'html.parser')

        info = []

        lists = soup.find_all('div', class_="profile-info-holder")

        for l in lists:
            name = l.find('div', class_="profile-name").text
            location = l.find('div', class_="profile-location").text
            links = l.find_all('a', class_="intercept", href=True)
            social1 = links[0]
            social2 = links[1]
            social3 = links[2]
        
            info = [name, location, social1.get('href'), social2.get('href'), social3.get('href')]
            thewriter.writerow(info)

Thank you!

CodePudding user response:

Example how to parse the HTML for different social links and save the result in pandas dataframe:

import re
import pandas as pd
from bs4 import BeautifulSoup

html_doc = """
    <div class="profile-info-holder">
        <div class="profile-name">Mark Red</div>
        <div class="profile-location">Los Angeles</div>
        <a class="intercept" href="https://instagram.com/markred">https://instagram.com/markred</a>
        <a class="intercept" href="https://tiktok.com/@markred">https://tiktok.com/@markred</a>
    </div>

    <div class="profile-info-holder">
        <div class="profile-name">Mary Green</div>
        <div class="profile-location">New York</div>
        <a class="intercept" href="https://instagram.com/marygreen">https://instagram.com/marygreen</a>
        <a class="intercept" href="https://youtuebe.com/marygreen">https://youtuebe.com/marygreen</a>
    </div> 
"""

all_info = []

soup = BeautifulSoup(html_doc, "html.parser")

lists = soup.find_all("div", class_="profile-info-holder")

for l in lists:
    name = l.find("div", class_="profile-name").text
    location = l.find("div", class_="profile-location").text
    links = l.find_all("a", class_="intercept", href=True)

    all_info.append(
        {
            "Name": name,
            "Location": location,
            **{
                re.search(r"https?://([^/] )", l["href"]).group(1): l["href"]
                for l in links
            },
        }
    )

df = pd.DataFrame(all_info)
# df.to_csv('data.csv', index=False)
print(df)

Prints:

         Name     Location                    instagram.com                   tiktok.com                    youtuebe.com
0    Mark Red  Los Angeles    https://instagram.com/markred  https://tiktok.com/@markred                             NaN
1  Mary Green     New York  https://instagram.com/marygreen                          NaN  https://youtuebe.com/marygreen
  • Related