Home > Software design >  Web scraping of coreyms.com
Web scraping of coreyms.com

Time:03-01

When I scrap the posts of website coreyms.com using BeautifulSoup, i.e., the heading, date, content and youtube link of the posts, I am facing this problem: all posts except one contains youtube link. So when I scrap the data, len(videolink)=9 and len(heading),len(date),len(content)=10. How can I make the len(videolink)=10 by inserting NaN in the post where youtube link is not present?

The code is given for reference:

from bs4 import BeautifulSoup
import requests
page7=requests.get('https://coreyms.com/')
page7
soup7=BeautifulSoup(page7.content)
soup7

heading=[]

for i in soup7.find_all('h2',class_='entry-title'):
    heading.append(i.text)
    
heading

date=[]

for i in soup7.find_all('time',class_='entry-time'):
    date.append(i.text)
    
date

content=[]

for i in soup7.find_all('div',class_='entry-content'):
    content.append(i.text)
    
content

videolink=[]

for i in soup7.find_all('iframe',class_='youtube-player'):
    videolink.append(i['src'])
    
videolink

print(len(heading),len(date),len(content),len(videolink))

CodePudding user response:

Rethink the way you process data and move away from this plethora of lists. Instead, persist the data with structured approaches like dict or list of dict (that structure also could simply be turned into dataframe)

Just iterate over all articles and check if information needed is available - if not set its value to None or what ever you like to set:

data = []

for a in soup7.find_all('article'):     
    data.append({
        'heading':a.h2.text,
        'date':a.find('time',class_='entry-time').text,
        'content':a.find('div',class_='entry-content').text,
        'videolink':vl['src'] if (vl := a.find('iframe',class_='youtube-player')) else None
    })

Example

from bs4 import BeautifulSoup
import requests
page7=requests.get('https://coreyms.com/')
page7
soup7=BeautifulSoup(page7.content)
soup7

data = []

for a in soup7.find_all('article'):     
    data.append({
        'heading':a.h2.text,
        'date':a.find('time',class_='entry-time').text,
        'content':a.find('div',class_='entry-content').text,
        'videolink':vl['src'] if (vl := a.find('iframe',class_='youtube-player')) else None
    })

print(data)

Output

[{'heading': 'Python Tutorial: Zip Files – Creating and Extracting Zip Archives', 'date': 'November 19, 2019', 'content': '\nIn this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…\n\n', 'videolink': 'https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'}, {'heading': 'Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey', 'date': 'October 17, 2019', 'content': '\nIn this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…\n\n\n\n', 'videolink': 'https://www.youtube.com/embed/_P7X8tMplsw?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'}, {'heading': 'Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module', 'date': 'September 21, 2019', 'content': '\nIn this Python Programming video, we will be learning how to run code in parallel using the multiprocessing module. We will also look at how to process multiple high-resolution images at the same time using a ProcessPoolExecutor from the concurrent.futures module. Let’s get started…\n\n\n\n', 'videolink': 'https://www.youtube.com/embed/fKl2JW_qrso?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'},...]
  • Related