Home > Net >  Webscraping a list of items
Webscraping a list of items

Time:12-21

This is my first time programming in rust (I'm currently reading the book) and I recently had a need to scrap a list of diseases and conditions for this site, after trying out several guides, I ended up with this small snippet. I'm currently stuck iterating through a ol, where instead of taking each li as an item in the array, it's being taken as a single element.

use error_chain::error_chain;
use select::document::Document;
use select::predicate::Class;

error_chain! {
      foreign_links {
          ReqError(reqwest::Error);
          IoError(std::io::Error);
      }
}

// Source: https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html#extract-all-links-from-a-webpage-html
#[tokio::main]
async fn main() -> Result<()> {
    let res = reqwest::get("https://www.cdc.gov/diseasesconditions/az/a.html")
        .await?
        .text()
        .await?;

    Document::from(res.as_str())
        .find(Class("unstyled-list")) // This is returning the the whole "ol"
        .for_each(|i| print!("{};", i.text()));

    Ok(())
}

Output, notice how the whole list is being printed as a single item istead of each desease being printed with the expected separator ;:

Abdominal Aortic Aneurysm — see Aortic AneurysmAcanthamoeba InfectionACE (Adverse Childhood Experiences)Acinetobacter InfectionAcquired Immune Deficiency Syndrome (AIDS) — see HIVAcute Flaccid Myelitis (AFM)Adenovirus InfectionAdenovirus VaccinationADHD [Attention Deficit/Hyperactivity Disorder]Adult VaccinationsAdverse Childhood Experiences (ACE)AFib, AF (Atrial fibrillation)AFMAfrican Trypanosomiasis — see Sleeping SicknessAgricultural Safety — see Farm Worker InjuriesAHF (Alkhurma hemorrhagic fever)AIDS (Acquired Immune Deficiency Syndrome)Alkhurma hemorrhagic fever (AHF)ALS [Amyotrophic Lateral Sclerosis]Alzheimer's DiseaseAmebiasis, Intestinal [Entamoeba histolytica infection]American Trypanosomiasis — see Chagas DiseaseAmphibians and Fish, Infections from — see Fish and Amphibians, Infections fromAmyotrophic Lateral Sclerosis — see ALSAnaplasmosis, HumanAncylostoma duodenale Infection, Necator americanus Infection — see Human HookwormAngiostrongylus InfectionAnimal-Related DiseasesAnisakiasis — see Anisakis InfectionAnisakis Infection [Anisakiasis]Anthrax VaccinationAnthrax [Bacillus anthracis Infection]Antibiotic-resistant Infections - ListingAntibiotic and Antimicrobial ResistanceAntibiotic Use, Appropriatesee also U.S. Antibiotic Awareness Week (USAAW)Aortic AneurysmAortic Dissection — see Aortic AneurysmArenavirus InfectionsArthritisChildhood ArthritisFibromyalgiaGoutOsteoarthritis (OA)Rheumatoid Arthritis (RA)Ascariasis — see Ascaris InfectionAscaris Infection [Ascariasis]Aseptic Meningitis — see Viral MeningitisAspergillosis — see Aspergillus InfectionAspergillus Infection [Aspergillosis]AsthmaAtrial fibrillation (AFib, AF)Attention Deficit/Hyperactivity Disorder — see ADHDAutismsee also Genetics and GenomicsAvian Influenza  ;

The expected output would instead be:

Abdominal Aortic Aneurysm — see Aortic AneurysmAcanthamoeba Infection;ACE (Adverse Childhood Experiences);Acinetobacter Infection; etc...

CodePudding user response:

find() returns a list of the elements matching the creteria. You need to call .children() to get the <li>s:

    Document::from(res.as_str())
        .find(Class("unstyled-list"))
        .next() // Get the first match
        .expect("no matching <ol>")
        .children()
        .for_each(|i| print!("{};", i.text()));
  • Related