Home > Software design >  Webscrape - Fields of different length
Webscrape - Fields of different length

Time:11-07

The current code scrapes individual fields, but I would like to map the time and the titles together.

Since the webpage does not have the time and titles in the same class, how would this mapping occur?

Piggy-backing off this question -Link (My question uses an example where the time and title is not of equal length)

Website for reference: https://ash.confex.com/ash/2021/webprogram/WALKS.html

Sample Expected Output:

5:00 PM-6:00 PM, ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease

5:00 PM-6:00 PM, ASH Poster Walk on Healthcare Quality Improvement

etc

import requests
from bs4 import BeautifulSoup

url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'

res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')

CodePudding user response:

This could be an alternative:

import requests
from bs4 import BeautifulSoup

url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'

res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

productlist = soup.select('div.itemtitle > a')
#times = soup.select('.time')

for a in productlist:
    title = a.text
    time = a.find_previous('h3').text
    date = a.find_previous('h4').text
    print(title, date, time, end = "\n")

OUTPUT

ASH Poster Walk on What's Hot in Sickle Cell Disease 
Wednesday, December 15, 2021
 10:00 AM-11:00 AM

ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Healthcare Quality Improvement 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Natural Killer Cell-Based Immunotherapy 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Pediatric Non-malignant Hematology Highlights 
Wednesday, December 15, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Clinical Trials In Progress 
Thursday, December 16, 2021
 10:00 AM-11:00 AM

ASH Poster Walk on Financial Toxicity in Hematologic Malignancies 
Thursday, December 16, 2021
 10:00 AM-11:00 AM

ASH Poster Walk on Diversity, Equity, and Inclusion in Hematologic Malignancies and Cell Therapy 
Thursday, December 16, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on Emerging Research in Immunotherapies 
Thursday, December 16, 2021
 5:00 PM-6:00 PM

ASH Poster Walk on the Spectrum of Hemostasis and Thrombosis Research 
Thursday, December 16, 2021
 5:00 PM-6:00 PM

CodePudding user response:

Try this:

content = soup.find('div', {"class": "content"})
times = content.find_all("h3")
output = []
for i,h3 in enumerate(times):
    for j in h3.next_siblings:
        if j.name:
            if j.name == "h3":
                break
            j = j.text.replace('\n', '')
            output.append(f"{times[i].text}, {j}")
print(output)
  • Related