I am working on a scraper, it works well on multiple URLs (it takes some basic info and multiple links associated with each URL), BUT since each URL has a different number of links associated with each profile (e.g. one URL can have 3 social links, another URL only 1 social link, then another URL with 2 social links) whenever I run it, as soon as the scraper encounters a URL without enough social links, it says "social2 = links[1] IndexError: list index out of range"
I am pretty sure I have to create a "IF" that tells the scraper to still scrape all the other info but to skip a column if the data (in this case, social links) is not present or are less then 2,3,4, etc so that if a URL has only 1 link or 2 links but not all the one I am calling, the scraper should write "None" for that specific link
Also, I'd like to tell the scraper to continue looking for all social links until the URL doesn't have anymore because I don't know how many MAX links each URL can actually have.
I'd like to see something like this in my CSV:
Name | Location | Link | Link 2 | Link 3 |
---|---|---|---|---|
Mark Red | Los Angeles | https://instagram.com/markred | None | https://tiktok.com/@markred |
Mary Green | New York | https://instagram.com/marygreen | https://youtuebe.com/marygreen | None |
My code is:
from bs4 import BeautifulSoup
import requests
from csv import writer
urls = ['https://url.com/1','https://url.com/2', 'https://url.com/3']
with open('multi.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Name', 'Location', 'Link', 'Link2', 'Link3']
thewriter.writerow(header)
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
info = []
lists = soup.find_all('div', class_="profile-info-holder")
for l in lists:
name = l.find('div', class_="profile-name").text
location = l.find('div', class_="profile-location").text
links = l.find_all('a', class_="intercept", href=True)
social1 = links[0]
social2 = links[1]
social3 = links[2]
info = [name, location, social1.get('href'), social2.get('href'), social3.get('href')]
thewriter.writerow(info)
Thank you!
CodePudding user response:
Example how to parse the HTML for different social links and save the result in pandas dataframe:
import re
import pandas as pd
from bs4 import BeautifulSoup
html_doc = """
<div class="profile-info-holder">
<div class="profile-name">Mark Red</div>
<div class="profile-location">Los Angeles</div>
<a class="intercept" href="https://instagram.com/markred">https://instagram.com/markred</a>
<a class="intercept" href="https://tiktok.com/@markred">https://tiktok.com/@markred</a>
</div>
<div class="profile-info-holder">
<div class="profile-name">Mary Green</div>
<div class="profile-location">New York</div>
<a class="intercept" href="https://instagram.com/marygreen">https://instagram.com/marygreen</a>
<a class="intercept" href="https://youtuebe.com/marygreen">https://youtuebe.com/marygreen</a>
</div>
"""
all_info = []
soup = BeautifulSoup(html_doc, "html.parser")
lists = soup.find_all("div", class_="profile-info-holder")
for l in lists:
name = l.find("div", class_="profile-name").text
location = l.find("div", class_="profile-location").text
links = l.find_all("a", class_="intercept", href=True)
all_info.append(
{
"Name": name,
"Location": location,
**{
re.search(r"https?://([^/] )", l["href"]).group(1): l["href"]
for l in links
},
}
)
df = pd.DataFrame(all_info)
# df.to_csv('data.csv', index=False)
print(df)
Prints:
Name Location instagram.com tiktok.com youtuebe.com
0 Mark Red Los Angeles https://instagram.com/markred https://tiktok.com/@markred NaN
1 Mary Green New York https://instagram.com/marygreen NaN https://youtuebe.com/marygreen