this is a follow up, question on the question which I asked earlier and got a very good answer, but, that code, I didn't understand fully the program. Please help me to scrape information from the following websites.
- https://premieragile.com/csm-training/
- https://www.simplilearn.com/agile-and-scrum/csm-certification-training
Here i want all the information given in each card. Also, adding the program I am using, which i got from stackoverflow itself.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://premieragile.com/csm-training/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for row in soup.select(".row > schedules-courses br-10 h-100 p-3 p-sm-4"):
date = row.findAll(".d-flex align-items-center pb-4 h6").text.strip()
# year = row.select_one(".li .batchDetails .date-details .date span").text.strip()
# rating = row.select_one(".imdbRating").text.strip()
# ...other variables
all_data.append([date])
df = pd.DataFrame(all_data, columns=["date"])
print(df.head().to_markdown(index=False))
here, please explain how I should add div class in the 'for loop', also, what will be the hierarchy of the
- div
- li
- h
- ul
- li
Please help me understand this, I got the general idea that we are crating empty list and adding data in those using beautiflSoup object. I am utterly confused in how I should study the website I want to scrape and thus, how to add column in the row of the program.
P.S I m getting blank output.
CodePudding user response:
Content is dynamically loaded from another resource. It do not contain in your soup, thats why you get an empty output.
Simply load it from this resource https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin and adjust parameters for your needs.
url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"
HMTL is wrapped in JSON structur so you have to specify the path from that the BeautifulSoup
object should be created from.
r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"
r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']
soup = BeautifulSoup(r)
all_data = []
for e in soup.select('.loop'):
all_data.append({
'trainer':e.h6.text.strip(),
'date': ' '.join(s.strip() for s in e.li.text.split('\n'))
})
all_data
df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))
Output
trainer | date |
---|---|
Daniel James Gullo | 08 Jul - 08 Jul - 2022 |
Raj Kasturi | 11 Jul - 13 Jul - 2022 |
Michel Goldenberg | 11 Jul - 12 Jul - 2022 |
Valerio Zanini | 12 Jul - 14 Jul - 2022 |
Michael Franken | 13 Jul - 15 Jul - 2022 |