Home > Mobile >  how to extract a specific numeric string from a HTML code?
how to extract a specific numeric string from a HTML code?

Time:10-21

i am downloading some HTML element into R

library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)

html4 <- getURL('http://website/Busqueda_persona.aspx', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z /=]*).*', '\\1', html4))
viewstategenerator <- as.character(sub('.*id="__VIEWSTATEGENERATOR" value="([0-9a-zA-Z /=]*).*', '\\1', html4))
eventvalidation <- as.character(sub('.*id="__EVENTVALIDATION" value="([0-9a-zA-Z /=]*).*', '\\1', html4))

params <- list(
  '__VIEWSTATE' = viewstate,
  '__VIEWSTATEGENERATOR' = viewstategenerator,
  '__EVENTVALIDATION' = eventvalidation,
  'ctl00$cphMain$ddlTipoIdentificacion' =  "296" ,
  'ctl00$cphMain$txtNumeroIdentificacion'  = "1109927000",
  'ctl00$cphMain$ddlTipoIdentificacionPersonaACargo' = "0",
  'ctl00$cphMain$btnBuscar' = "Buscar"
  )

html5 = postForm('http://website/Busqueda_persona.aspx', .params = params, curl = curl)

part of the resulting html5 includes this

onclick='javascript:Direccionar(1682000,296,"1109927000",1);'

I require to extract the 1682000 and store it into a separate element

EDIT1: after trying @akrun advise, i get this

sub("\\D (\\d ).*", "\\1", html5)
[1] "3"
attr(,"Content-Type")
                charset 
"text/html"     "utf-8"

i have uploaded the entire html5 R element here https://controlc.com/b9d622ff

CodePudding user response:

We may use str_extract from stringr

library(stringr)
 str_extract_all(str2, "(?<=onclick\\='javascript\\:Direccionar\\()\\d ")[[1]]
[1] "1682000" "1682000"

Or use in combination with parse_number

readr::parse_number(str_extract_all(str1, "onclick='javascript\\:Direccionar\\([0-9] ")[[1]])
[1] 1682000

There are two instance of the substring

> substr(str2, 51380, 51418)
[1] "onclick='javascript:Direccionar(1682000"
> substr(str2, 51536, 51574)
[1] "onclick='javascript:Direccionar(1682000"

It was found by str_locate_all

> str_locate_all(str2, "(?<=onclick='javascript:Direccionar\\()[0-9] ")
[[1]]
     start   end
[1,] 51412 51418
[2,] 51568 51574
  •  Tags:  
  • r
  • Related