Home > OS >  Detect strings that have characters after specific character
Detect strings that have characters after specific character

Time:08-04

I am scrapping datasets but specific files are mislabeled and are throwing off the dependent code. What I am trying to do now is filter for relevancy before passing the strings on for further analysis.

All the data comes from USDA. Here are some sample strings

2022_ADMLivestockLrp_Daily_20220617.zip.faerkdb3.jpj 
2022_ADMLivestockLrp_Daily_20220618.zip

What I want is to detect which strings DO NOT have characters AFTER the ".zip". I have been trying to use grepl and stringer with a ".zip*" wild card but cannot figure it out. I am not trying to delete the characters just to detect whether they exist or not. Any help is appreciated.

Here is what I have tried

  url <'https://ftp.rma.usda.gov/pub/references/adm_livestock/2022/')
  href <- read_html(URL)
  href_names <-  as.list(html_attr(html_nodes(href, "a"), "href"))
  href_zip <-  href_names[grepl(".zip*", href_names)]

CodePudding user response:

grep("[.]zip$", href_names, value =TRUE)

[1] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20210701.zip"     
[2] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20210723.zip"     
[3] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20211021.zip"     
[4] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20220125.zip"     
[5] "/pub/references/adm_livestock/2022/2022_A00831_ADMDrpDraw_Quarterly_20220421.zip"     
[6] "/pub/references/adm_livestock/2022/2022_A00832_ADMDrpMilkYield_Quarterly_20210630.zip"

CodePudding user response:

Try this, it will match ".zip" only if it has at least one character that follows it.

https://regex101.com/r/xcjWu9/1

This one will match only if it finds ".zip" and NOT any characters following it.

https://regex101.com/r/dcXCUd/1

  • Related