This is my first time programming in rust (I'm currently reading the book) and I recently had a need to scrap a list of diseases and conditions for this site, after trying out several guides, I ended up with this small snippet. I'm currently stuck iterating through a ol
, where instead of taking each li
as an item in the array, it's being taken as a single element.
use error_chain::error_chain;
use select::document::Document;
use select::predicate::Class;
error_chain! {
foreign_links {
ReqError(reqwest::Error);
IoError(std::io::Error);
}
}
// Source: https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html#extract-all-links-from-a-webpage-html
#[tokio::main]
async fn main() -> Result<()> {
let res = reqwest::get("https://www.cdc.gov/diseasesconditions/az/a.html")
.await?
.text()
.await?;
Document::from(res.as_str())
.find(Class("unstyled-list")) // This is returning the the whole "ol"
.for_each(|i| print!("{};", i.text()));
Ok(())
}
Output, notice how the whole list is being printed as a single item istead of each desease being printed with the expected separator ;
:
Abdominal Aortic Aneurysm — see Aortic AneurysmAcanthamoeba InfectionACE (Adverse Childhood Experiences)Acinetobacter InfectionAcquired Immune Deficiency Syndrome (AIDS) — see HIVAcute Flaccid Myelitis (AFM)Adenovirus InfectionAdenovirus VaccinationADHD [Attention Deficit/Hyperactivity Disorder]Adult VaccinationsAdverse Childhood Experiences (ACE)AFib, AF (Atrial fibrillation)AFMAfrican Trypanosomiasis — see Sleeping SicknessAgricultural Safety — see Farm Worker InjuriesAHF (Alkhurma hemorrhagic fever)AIDS (Acquired Immune Deficiency Syndrome)Alkhurma hemorrhagic fever (AHF)ALS [Amyotrophic Lateral Sclerosis]Alzheimer's DiseaseAmebiasis, Intestinal [Entamoeba histolytica infection]American Trypanosomiasis — see Chagas DiseaseAmphibians and Fish, Infections from — see Fish and Amphibians, Infections fromAmyotrophic Lateral Sclerosis — see ALSAnaplasmosis, HumanAncylostoma duodenale Infection, Necator americanus Infection — see Human HookwormAngiostrongylus InfectionAnimal-Related DiseasesAnisakiasis — see Anisakis InfectionAnisakis Infection [Anisakiasis]Anthrax VaccinationAnthrax [Bacillus anthracis Infection]Antibiotic-resistant Infections - ListingAntibiotic and Antimicrobial ResistanceAntibiotic Use, Appropriatesee also U.S. Antibiotic Awareness Week (USAAW)Aortic AneurysmAortic Dissection — see Aortic AneurysmArenavirus InfectionsArthritisChildhood ArthritisFibromyalgiaGoutOsteoarthritis (OA)Rheumatoid Arthritis (RA)Ascariasis — see Ascaris InfectionAscaris Infection [Ascariasis]Aseptic Meningitis — see Viral MeningitisAspergillosis — see Aspergillus InfectionAspergillus Infection [Aspergillosis]AsthmaAtrial fibrillation (AFib, AF)Attention Deficit/Hyperactivity Disorder — see ADHDAutismsee also Genetics and GenomicsAvian Influenza ;
The expected output would instead be:
Abdominal Aortic Aneurysm — see Aortic AneurysmAcanthamoeba Infection;ACE (Adverse Childhood Experiences);Acinetobacter Infection; etc...
CodePudding user response:
find()
returns a list of the elements matching the creteria. You need to call .children()
to get the <li>
s:
Document::from(res.as_str())
.find(Class("unstyled-list"))
.next() // Get the first match
.expect("no matching <ol>")
.children()
.for_each(|i| print!("{};", i.text()));