The current code scrapes individual fields, but I would like to map the time and the titles together.
Since the webpage does not have the time and titles in the same class, how would this mapping occur?
Piggy-backing off this question -Link (My question uses an example where the time and title is not of equal length)
Website for reference: https://ash.confex.com/ash/2021/webprogram/WALKS.html
Sample Expected Output:
5:00 PM-6:00 PM, ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
5:00 PM-6:00 PM, ASH Poster Walk on Healthcare Quality Improvement
etc
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
CodePudding user response:
This could be an alternative:
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
#times = soup.select('.time')
for a in productlist:
title = a.text
time = a.find_previous('h3').text
date = a.find_previous('h4').text
print(title, date, time, end = "\n")
OUTPUT
ASH Poster Walk on What's Hot in Sickle Cell Disease
Wednesday, December 15, 2021
10:00 AM-11:00 AM
ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Healthcare Quality Improvement
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Natural Killer Cell-Based Immunotherapy
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Pediatric Non-malignant Hematology Highlights
Wednesday, December 15, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Clinical Trials In Progress
Thursday, December 16, 2021
10:00 AM-11:00 AM
ASH Poster Walk on Financial Toxicity in Hematologic Malignancies
Thursday, December 16, 2021
10:00 AM-11:00 AM
ASH Poster Walk on Diversity, Equity, and Inclusion in Hematologic Malignancies and Cell Therapy
Thursday, December 16, 2021
5:00 PM-6:00 PM
ASH Poster Walk on Emerging Research in Immunotherapies
Thursday, December 16, 2021
5:00 PM-6:00 PM
ASH Poster Walk on the Spectrum of Hemostasis and Thrombosis Research
Thursday, December 16, 2021
5:00 PM-6:00 PM
CodePudding user response:
Try this:
content = soup.find('div', {"class": "content"})
times = content.find_all("h3")
output = []
for i,h3 in enumerate(times):
for j in h3.next_siblings:
if j.name:
if j.name == "h3":
break
j = j.text.replace('\n', '')
output.append(f"{times[i].text}, {j}")
print(output)