Home > Enterprise >  Scrape This Field Using BeautifulSoup
Scrape This Field Using BeautifulSoup

Time:11-08

Im not able to get the current field "type" using BeautifulSoup.

Current code prints blank for "type" variable Picture

from bs4 import BeautifulSoup
import requests

url='https://ash.confex.com/ash/2021/webprogram/Session20851.html'

res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

content=soup.find_all('div',class_='paper')

for property in content:
    title=property.find('div',class_='cricon').text
    type=property.find("div",{"id":"info"})

CodePudding user response:

As you can see here, this is looks like a "property" variable, during each iteration of content.

<div >
<div >9:30 AM</div>
<div ><a href="Paper146905.html">7</a></div>
<div >
<div ><a href="Paper146905.html">Sustained Improvements in Patient-Reported Quality of Life up to 24 Months Post-Treatment with LentiGlobin for Sickle Cell Disease (bb1111) Gene Therapy</a></div>
<span >
<p ><b>Mark C. Walters, MD</b><sup>1</sup>, John F. Tisdale, MD<sup>2</sup><sup>*</sup>, Markus Y. Mapara, MD, PhD<sup>3</sup>, Lakshmanan Krishnamurti, MD<sup>4</sup>, Janet L. Kwiatkowski, MD, MSCE<sup>5,6</sup>, Banu Aygun, MD<sup>7</sup>, Kimberly A. Kasow, DO<sup>8</sup><sup>*</sup>, Stacey Rifkin-Zenenberg, DO<sup>9</sup>, Jennifer Jaroscak, MD<sup>10</sup>, Diana Garbinsky, MS<sup>11</sup><sup>*</sup>, Costel Chirila, PhD<sup>11</sup><sup>*</sup>, Meghan E. Gallagher, MSc<sup>12</sup><sup>*</sup>, Xinyan Zhang, PhD<sup>12</sup><sup>*</sup>, Pei-Ran Ho, MD<sup>12</sup><sup>*</sup>, Alexis A. Thompson, MD, MPH<sup>13,14</sup> and Julie Kanter, MD<sup>15</sup></p><p ><sup>1</sup>Division of Hematology, UCSF Benioff Children's Hospital Oakland, Oakland, CA<br/><sup>2</sup>Cellular and Molecular Therapeutics Branch NHLBI/NIDDK, National Institutes of Health, Bethesda, MD<br/><sup>3</sup>Division of Hematology/Oncology, Columbia Center for Translational Immunology, Columbia University Medical Center, New York, NY<br/><sup>4</sup>Aflac Cancer and Blood Disorders Center, Department of Pediatrics, Emory Healthcare, Atlanta, GA<br/><sup>5</sup>Division of Hematology, Children's Hospital of Philadephia, Philadelphia, PA<br/><sup>6</sup>Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA<br/><sup>7</sup>Cohen Children’s Medical Center, Queens, NY<br/><sup>8</sup>University of North Carolina, Chapel Hill<br/><sup>9</sup>Hackensack University Medical Center, Hackensack, NJ<br/><sup>10</sup>University Medical Center, Medical University of South Carolina Health, Charleston, SC<br/><sup>11</sup>RTI Health Solutions, Research Triangle Park, NC<br/><sup>12</sup>bluebird bio, Inc., Cambridge, MA<br/><sup>13</sup>Feinberg School of Medicine, Northwestern University, Chicago, IL<br/><sup>14</sup>Ann &amp; Robert H. Lurie Children’s Hospital of Chicago, Chicago, IL<br/><sup>15</sup>University of Alabama Birmingham, Birmingham, AL</p>
</span>
<div ></div>
<div >
</div>
</div>
</div>

In other words, you are iterating over each event, but you need to get only the header div, called "info" in ID.

this should work for you...

from bs4 import BeautifulSoup
import requests

url='https://ash.confex.com/ash/2021/webprogram/Session20851.html'

res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

content=soup.find_all('div',class_='paper')
info = soup.find_all('div',class_ = 'datetime')
type = soup.find("span", string="Type:").next_sibling


for property in content:
    title=property.find('div',class_='cricon').text
    print(title, type, sep = "\n", end = "\n\n")

OUTPUT

Sustained Improvements in Patient-Reported Quality of Life up to 24 Months Post-Treatment with LentiGlobin for Sickle Cell Disease (bb1111) Gene Therapy
 Oral

Activation of Pyruvate Kinase-R with Etavopivat (FT-4202) Is Well Tolerated, Improves Anemia, and Decreases Intravascular Hemolysis in Patients with Sickle Cell Disease Treated for up to 12 Weeks
 Oral

Etavopivat, an Allosteric Activator of Pyruvate Kinase-R, Improves Sickle RBC Functional Health and Survival and Reduces Systemic Markers of Inflammation and Hypercoagulability in Patients with Sickle Cell Disease: An Analysis of Exploratory Studies in a Phase 1 Study
 Oral

Mitapivat (AG-348) Demonstrates Safety, Tolerability, and Improvements in Anemia, Hemolysis, Oxygen Affinity, and Hemoglobin S Polymerization Kinetics in Adults with Sickle Cell Disease: A Phase 1 Dose Escalation Study
 Oral

Hydroxyurea Reduces the Transfusion Burden in Children with Sickle Cell Anemia: The Reach Experience
 Oral

Initial Safety and Efficacy Results from the Phase II, Multicenter, Open-Label Solace-Kids Trial of Crizanlizumab in Adolescents with Sickle Cell Disease (SCD)
 Oral

CodePudding user response:

@Void S, You also can do that using if else statement as follows:

from bs4 import BeautifulSoup
import requests

url = 'https://ash.confex.com/ash/2021/webprogram/Session20851.html'

res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

content = soup.find_all('div', class_='paper')

for property in content:
    title = property.find('div', class_='cricon').text
    type = property.find("div", {"id": "info"}).text if property.find("div", {"id": "info"}) else "oral"

    print('title:'  str(title) ,'type:'  str(type),sep='\n', end = '\n\n')

Output:

title:Sustained Improvements in Patient-Reported Quality of Life up to 24 Months Post-Treatment with LentiGlobin for Sickle Cell Disease (bb1111) Gene Therapy
type:oral
title:Activation of Pyruvate Kinase-R with Etavopivat (FT-4202) Is Well Tolerated, Improves Anemia, and Decreases Intravascular Hemolysis in Patients with Sickle Cell Disease Treated for up to 12 Weeks
type:oral
title:Etavopivat, an Allosteric Activator of Pyruvate Kinase-R, Improves Sickle RBC Functional Health and Survival and Reduces Systemic Markers of Inflammation and Hypercoagulability in Patients with Sickle Cell Disease: An Analysis of Exploratory Studies in a Phase 1 Study
type:oral
title:Mitapivat (AG-348) Demonstrates Safety, Tolerability, and Improvements in Anemia, Hemolysis, Oxygen Affinity, 
and Hemoglobin S Polymerization Kinetics in Adults with Sickle Cell Disease: A Phase 1 Dose Escalation Study        
type:oral
title:Hydroxyurea Reduces the Transfusion Burden in Children with Sickle Cell Anemia: The Reach Experience
type:oral
title:Initial Safety and Efficacy Results from the Phase II, Multicenter, Open-Label Solace-Kids Trial of Crizanlizumab in Adolescents with Sickle Cell Disease (SCD)
type:oral

(scrapyEnv) F:\stackOverflow_answer\stackoverflow-03>python ama.py
title:Sustained Improvements in Patient-Reported Quality of Life up to 24 Months Post-Treatment with LentiGlobin for Sickle Cell Disease (bb1111) Gene Therapy
type:oral

title:Activation of Pyruvate Kinase-R with Etavopivat (FT-4202) Is Well Tolerated, Improves Anemia, and Decreases Intravascular Hemolysis in Patients with Sickle Cell Disease Treated for up to 12 Weeks
type:oral

title:Etavopivat, an Allosteric Activator of Pyruvate Kinase-R, Improves Sickle RBC Functional Health and Survival and Reduces Systemic Markers of Inflammation and Hypercoagulability in Patients with Sickle Cell Disease: An Analysis of Exploratory Studies in a Phase 1 Study
type:oral

title:Mitapivat (AG-348) Demonstrates Safety, Tolerability, and Improvements in Anemia, Hemolysis, Oxygen Affinity, 
and Hemoglobin S Polymerization Kinetics in Adults with Sickle Cell Disease: A Phase 1 Dose Escalation Study        
type:oral

title:Hydroxyurea Reduces the Transfusion Burden in Children with Sickle Cell Anemia: The Reach Experience
type:oral

title:Initial Safety and Efficacy Results from the Phase II, Multicenter, Open-Label Solace-Kids Trial of Crizanlizumab in Adolescents with Sickle Cell Disease (SCD)
type:oral
  • Related