I'm having some problems scraping data from a website. I do have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.fatf-gafi.org/countries/
More precisely, I want to extract the list of Countries with some sort of sanctions
library(XML)
url <- paste0("https://www.fatf-gafi.org/countries/")
source <- readLines(url, encoding = "UTF-8")
parsed_doc <- htmlParse(source, encoding = "UTF-8")
But this doesn't bring up the intended information because is not under a table but it is a nested div.
CodePudding user response:
This is a tricky parsing job. The information you need is not in the html you are getting from readLines
. Instead, it is loaded dynamically by the page using an XHR request. Often, an XHR request like this will return a json string, but in your case it returns javascript where the information is stored as a variable containing an array of json snippets, one for each country. This can be accessed through some string manipulation and json parsing to get your end result:
library(httr)
library(rvest)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')
vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209],
function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))
dplyr::tibble(countries)
#> # A tibble: 209 x 11
#> name FATF APG CFATF EAG ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghani~ "" "mbr" "" "obs" "" "" "" "" "" ""
#> 2 Albania "" "" "" "" "" "" "" "" "" "mbr"
#> 3 Algeria "" "" "" "" "" "" "" "" "mbr" ""
#> 4 Andorra "" "" "" "" "" "" "" "" "" "mbr"
#> 5 Angola "" "" "" "" "mbr" "" "" "" "" ""
#> 6 Anguilla "" "" "mbr" "" "" "" "" "" "" ""
#> 7 Antigua~ "" "" "mbr" "" "" "" "" "" "" ""
#> 8 Argenti~ "mbr" "non" "non" "non" "non" "" "mbr" "non" "non" "non"
#> 9 Armenia "" "" "" "obs" "" "" "" "" "" "mbr"
#> 10 Aruba K~ "els" "" "mbr" "" "" "" "" "" "" ""
#> # ... with 199 more rows
CodePudding user response:
Just to test how JavaScript evaluation works with V8, Embedded JavaScript and WebAssembly Engine.
https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html
Create context engine, evaluate requested JavaScript and get the value of countries
variable from V8 (it's turned into nested dataframe, thus the unnest()
), last row is filled with NA
s, thus the filter.
library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')
ct <- v8()
ct$eval(js_content)
ct$get("countries") %>%
unnest(cols = c(groups)) %>%
select(c(1:2,4:14,16)) %>%
filter(!is.na(name))
#> # A tibble: 209 × 14
#> name code FATF APG CFATF EAG ESAAMLG GABAC GAFILAT GIABA MENAFATF
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanist… AF "" "mbr" "" "obs" "" "" "" "" ""
#> 2 Albania AL "" "" "" "" "" "" "" "" ""
#> 3 Algeria DZ "" "" "" "" "" "" "" "" "mbr"
#> 4 Andorra AD "" "" "" "" "" "" "" "" ""
#> 5 Angola AO "" "" "" "" "mbr" "" "" "" ""
#> 6 Anguilla AI "" "" "mbr" "" "" "" "" "" ""
#> 7 Antigua a… AG "" "" "mbr" "" "" "" "" "" ""
#> 8 Argentina AR "mbr" "non" "non" "non" "non" "" "mbr" "non" "non"
#> 9 Armenia AM "" "" "" "obs" "" "" "" "" ""
#> 10 Aruba Kin… AW "els" "" "mbr" "" "" "" "" "" ""
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> # jurisdiction <chr>, id <chr>