Home > Blockchain >  Web scraping with R - drop down menu
Web scraping with R - drop down menu

Time:12-28

I'm trying to scraping from this address:

http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM#

There are two drop down ("Origem" and "Destino"). I need to generate a database with all possible combinations of "Origem" and "Destino".

Below a part of the code in R. I'm not able to select an option within the drop down menu, so I can create a looping and extract the data I need.

Any suggestions?

library(RSelenium)  # activate Selenium server
library(rJava)
remDr <- rs_driver_object$client

remDr$open()
remDr$navigate("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM#")

Origem <- remDr$findElement(using = 'id', 'Origem')
Destino <- remDr$findElement(using = 'id', 'Destino')
botão_pesquisar <- remDr$findElement(using = 'id', 'btnPesquisar')


CodePudding user response:

Grab the values (which are the location IDs) in each combo box, have two arrays (from and to), make sure to append the labels also; this page makes a call to an endpoint that has the IDs posted as parameters - the call looks like this:

library(RCurl)
headers = c(
  "Accept" = "application/json, text/javascript, */*; q=0.01",
  "Accept-Language" = "en-US,en;q=0.9",
  "Connection" = "keep-alive",
  "Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
  "Cookie" = "__RequestVerificationToken_L1RyYW5zcG9ydGVDb2xldGl2bw2=tY-yKlWmbZvAJzMHmITkohPiIos5XkjDBwf1ZBfP_bYWdXJMBF2Qw3z_B-LRVo0kXjdnHqDqsbZ04Zij_PM-wAf4DWVKfnQskOhqo4ANSRc1",
  "Origin" = "http://extranet.artesp.sp.gov.br",
  "Referer" = "http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
  "X-Requested-With" = "XMLHttpRequest"
)
params = "origem=387&destino=388&__RequestVerificationToken=Z-wXmGOb9pnQbmkfcQXmChT-6uc3YfGjftHwK4HnC9SDCaKmzIafo7AI3lChBY6YDBHdpT_X98mSHGAr_YrTNgKiepKxKraGu7p6PI7dV4g1"
res <- postForm("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino/GetGrid", .opts=list(postfields = params, httpheader = headers, followlocation = TRUE), style = "httppost")
cat(res)

See the origem= and the destino= parameters? those are the values from the static combo box fields, would be easy to do this whole thing via simple web requests; the response for each call will look like this:

[
    {
        "Codigo": 0,
        "Empresa": {
            "Codigo": 447,
            "Descricao": "VIAÇÃO VALE DO TIETE LTDA",
            "FlagCNPJ": false,
            "CNPJ": null,
            "CPF": null,
            "Fretamento": null,
            "Escolar": null,
            "Municipio": null,
            "UF": null,
            "Endereco": null,
            "Bairro": null,
            "CEP": null,
            "Telefone": null,
            "Email": null
        },
        "CodigoMunicipioOrigem": 387,
        "CodigoMunicipioDestino": 388
    }
]

So when a trip is found, you'll have an array of.. Unsure what this is but entries for tickets I am assuming; the array returns 0 (null array) when the origin and destination have no schedules.

  • Related