Home > Enterprise >  Web-Scraping using R - I want to extract some table like data from a website
Web-Scraping using R - I want to extract some table like data from a website

Time:06-03

I'm having some problems scraping data from a website. I do have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.fatf-gafi.org/countries/

More precisely, I want to extract the list of Countries with some sort of sanctions

library(XML)
  url <- paste0("https://www.fatf-gafi.org/countries/")
  source <- readLines(url, encoding = "UTF-8")
  parsed_doc <- htmlParse(source, encoding = "UTF-8")

But this doesn't bring up the intended information because is not under a table but it is a nested div.

CodePudding user response:

This is a tricky parsing job. The information you need is not in the html you are getting from readLines. Instead, it is loaded dynamically by the page using an XHR request. Often, an XHR request like this will return a json string, but in your case it returns javascript where the information is stored as a variable containing an array of json snippets, one for each country. This can be accessed through some string manipulation and json parsing to get your end result:

library(httr)
library(rvest)

url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js <- content(GET(url), 'text')

vars <- strsplit(js, 'var countries = ')[[1]][2]
vars <- paste0("{", sub("^\\[\\{", "", strsplit(vars, '\\},\\{')[[1]]), "}")
countries <- do.call(rbind, lapply(vars[1:209], 
                      function(x) as.data.frame(jsonlite::parse_json(x))))
countries <- countries[c(1, 4:13)]
names(countries) <- sub('^.*\\.', '', names(countries))

dplyr::tibble(countries)
#> # A tibble: 209 x 11
#>   name     FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF MONEYVAL
#>   <chr>    <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>    <chr>   
#> 1 Afghani~ ""    "mbr" ""    "obs" ""      ""    ""      ""    ""       ""      
#> 2 Albania  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 3 Algeria  ""    ""    ""    ""    ""      ""    ""      ""    "mbr"    ""      
#> 4 Andorra  ""    ""    ""    ""    ""      ""    ""      ""    ""       "mbr"   
#> 5 Angola   ""    ""    ""    ""    "mbr"   ""    ""      ""    ""       ""      
#> 6 Anguilla ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 7 Antigua~ ""    ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> 8 Argenti~ "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"    "non"   
#> 9 Armenia  ""    ""    ""    "obs" ""      ""    ""      ""    ""       "mbr"   
#> 10 Aruba K~ "els" ""    "mbr" ""    ""      ""    ""      ""    ""       ""      
#> # ... with 199 more rows

CodePudding user response:

Just to test how JavaScript evaluation works with V8, Embedded JavaScript and WebAssembly Engine.
https://cran.r-project.org/web/packages/V8/vignettes/v8_intro.html

Create context engine, evaluate requested JavaScript and get the value of countries variable from V8 (it's turned into nested dataframe, thus the unnest() ), last row is filled with NAs, thus the filter.

library(httr)
library(V8)
library(dplyr)
library(tidyr)
url <- paste0('https://www.fatf-gafi.org/media/fatf/fatfv20/',
              'js/country-data-multi-lang.js')
js_content <- content(GET(url), 'text')

ct <- v8()
ct$eval(js_content)
ct$get("countries") %>% 
  unnest(cols = c(groups)) %>%
  select(c(1:2,4:14,16)) %>%
  filter(!is.na(name))

#> # A tibble: 209 × 14
#>    name       code  FATF  APG   CFATF EAG   ESAAMLG GABAC GAFILAT GIABA MENAFATF
#>    <chr>      <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <chr> <chr>   
#>  1 Afghanist… AF    ""    "mbr" ""    "obs" ""      ""    ""      ""    ""      
#>  2 Albania    AL    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  3 Algeria    DZ    ""    ""    ""    ""    ""      ""    ""      ""    "mbr"   
#>  4 Andorra    AD    ""    ""    ""    ""    ""      ""    ""      ""    ""      
#>  5 Angola     AO    ""    ""    ""    ""    "mbr"   ""    ""      ""    ""      
#>  6 Anguilla   AI    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  7 Antigua a… AG    ""    ""    "mbr" ""    ""      ""    ""      ""    ""      
#>  8 Argentina  AR    "mbr" "non" "non" "non" "non"   ""    "mbr"   "non" "non"   
#>  9 Armenia    AM    ""    ""    ""    "obs" ""      ""    ""      ""    ""      
#> 10 Aruba Kin… AW    "els" ""    "mbr" ""    ""      ""    ""      ""    ""      
#> # … with 200 more rows, and 3 more variables: MONEYVAL <chr>,
#> #   jurisdiction <chr>, id <chr>
  • Related