Home > Software design >  'Scraping data from a submitted form from SIPRI
'Scraping data from a submitted form from SIPRI

Time:08-18

I am trying to get data from a website (https://armstrade.sipri.org/armstrade/page/values.php) which requires submitting a form. There are some radio buttons and drop down boxes where you can select a time period (years) and countries and a download method. I am aware the the data can be downloaded manually, but I would like to programatically download the import data for all countries between 1990 and 2000.

I have tried two different approaches based on answers on SO (see below for code), but am having trouble getting it to actually produce results. Ideally, I would like a dataframe similar to one in the downloaded excel file. Any help or guidance would be greatly appreciated.

Thankyou in advance.

Approach 1

Th first approach is based on Python code for the same site: Scrape a php webpage that needs a submitted form

library(httr)
library(rvest)
df = httr::POST("https://armstrade.sipri.org/armstrade/html/export_values.php", 
             encode = "form",
             body = list('import_or_export' = 'export',
                         'country_code'= 'All',
                         'from' = 1990,
                         'to' = 2000,
                         'summarize' = 'country',
                         'filetype'= 'excel',
                         'Action' ='Download'),
             verbose())

Approach 2

The second approach I've tried is relatively similar to this approach, How to retrieve response by using POST in R

headers = c('Content-Type' = 'application/json; charset=UTF-8')
data = "{'country_code':'All','low_year':'1990','high_year':'2000','import_or_export':'import','summarize':'country','filetype':'html','Action':'Download'}"
r <- httr::POST(url = "https://armstrade.sipri.org/armstrade/html/export_values.php", 
                httr::add_headers(.headers=headers), body = data)

CodePudding user response:

I leave the parsing and cleaning to you, but here's a suggestion for the request

library(tidyverse)
library(httr2)
library(rvest)

"https://armstrade.sipri.org/armstrade/html/export_values.php" %>% 
  request() %>%  
  req_body_form(
    'import_or_export' = 'export',
    'country_code'= '',
    'low_year' = 1990,
    'high_year' = 2000,
    'summarize' = 'country',
    'filetype'= 'html',
    'Action' = 'Download'
  ) %>%  
  req_perform() %>% 
  resp_body_html() %>% 
  html_table %>% 
  getElement(2) %>% 
  slice(11:nrow(.))

# A tibble: 89 x 14
   X1        X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13   X14  
   <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 &nbsp     1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  Total NA   
 2 Angola    &nbsp &nbsp 8     &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp 8     NA   
 3 Argentina 6     0     &nbsp 13    5     5     &nbsp &nbsp &nbsp &nbsp 2     31    NA   
 4 Aruba     &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp 18    &nbsp &nbsp &nbsp &nbsp 18    NA   
 5 Australia 168   90    &nbsp 30    36    36    16    20    4     &nbsp &nbsp 400   NA   
 6 Austria   30    20    20    10    17    &nbsp 18    1     29    23    24    191   NA   
 7 Belarus   &nbsp &nbsp &nbsp 8     &nbsp 7     129   398   63    452   293   1349  NA   
 8 Belgium   1     1     &nbsp &nbsp 33    158   57    93    46    45    26    458   NA   
 9 Brazil    106   127   98    40    54    38    27    27    18    &nbsp &nbsp 535   NA   
10 Bulgaria  6     42    16    28    55    1     21    6     39    167   2     381   NA   
# ... with 79 more rows
  • Related