Home > Blockchain >  Unable to get data form <li> _data_ </li> and using python, I m making web scraper
Unable to get data form <li> _data_ </li> and using python, I m making web scraper

Time:07-08

this is a follow up, question on the question which I asked earlier and got a very good answer, but, that code, I didn't understand fully the program. Please help me to scrape information from the following websites.

  1. https://premieragile.com/csm-training/
  2. https://www.simplilearn.com/agile-and-scrum/csm-certification-training

Here i want all the information given in each card. Also, adding the program I am using, which i got from stackoverflow itself.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://premieragile.com/csm-training/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for row in soup.select(".row > schedules-courses br-10 h-100 p-3 p-sm-4"):
    date = row.findAll(".d-flex align-items-center pb-4 h6").text.strip()
#     year = row.select_one(".li .batchDetails .date-details .date span").text.strip()
#     rating = row.select_one(".imdbRating").text.strip()
    # ...other variables

    all_data.append([date])


df = pd.DataFrame(all_data, columns=["date"])
print(df.head().to_markdown(index=False))

here, please explain how I should add div class in the 'for loop', also, what will be the hierarchy of the

  1. div
  2. li
  3. h
  4. ul
  5. li

Please help me understand this, I got the general idea that we are crating empty list and adding data in those using beautiflSoup object. I am utterly confused in how I should study the website I want to scrape and thus, how to add column in the row of the program.

P.S I m getting blank output.

CodePudding user response:

Content is dynamically loaded from another resource. It do not contain in your soup, thats why you get an empty output.

Simply load it from this resource https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin and adjust parameters for your needs.

url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"

HMTL is wrapped in JSON structur so you have to specify the path from that the BeautifulSoup object should be created from.

r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
import json

url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"

r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']

soup = BeautifulSoup(r)

all_data = []
for e in soup.select('.loop'):
    all_data.append({
        'trainer':e.h6.text.strip(),
        'date': ' '.join(s.strip() for s in e.li.text.split('\n'))
    })
all_data

df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))

Output

trainer date
Daniel James Gullo 08 Jul - 08 Jul - 2022
Raj Kasturi 11 Jul - 13 Jul - 2022
Michel Goldenberg 11 Jul - 12 Jul - 2022
Valerio Zanini 12 Jul - 14 Jul - 2022
Michael Franken 13 Jul - 15 Jul - 2022
  • Related