I am scrapping datasets but specific files are mislabeled and are throwing off the dependent code. What I am trying to do now is filter for relevancy before passing the strings on for further analysis.
All the data comes from USDA. Here are some sample strings
2022_ADMLivestockLrp_Daily_20220617.zip.faerkdb3.jpj
2022_ADMLivestockLrp_Daily_20220618.zip
What I want is to detect which strings DO NOT have characters AFTER the ".zip". I have been trying to use grepl and stringer with a ".zip*" wild card but cannot figure it out. I am not trying to delete the characters just to detect whether they exist or not. Any help is appreciated.
Here is what I have tried
url <'https://ftp.rma.usda.gov/pub/references/adm_livestock/2022/')
href <- read_html(URL)
href_names <- as.list(html_attr(html_nodes(href, "a"), "href"))
href_zip <- href_names[grepl(".zip*", href_names)]
CodePudding user response:
grep("[.]zip$", href_names, value =TRUE)
[1] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20210701.zip"
[2] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20210723.zip"
[3] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20211021.zip"
[4] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20220125.zip"
[5] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20220421.zip"
[6] "/pub/references/adm_livestock/2022/2022_A00832_ADMDrpMilkYield_Quarterly_20210630.zip"
CodePudding user response:
Try this, it will match ".zip" only if it has at least one character that follows it.
https://regex101.com/r/xcjWu9/1
This one will match only if it finds ".zip" and NOT any characters following it.