Home > Back-end >  R: Webscraping JSON files?
R: Webscraping JSON files?

Time:07-14

I am working with the R programming language.

I have a website (e.g. https://mywebsite.com) where the source code (i.e. ctrl u) looks something like this:

[
{

"title1" : "abc 123",
"title2" : "bca 321",
"title3" : "cba 213",
"title4" : {"title5": "title6", "title7": [ -17662.3456, 987621.7654]}
},

...


{

"title1" : "aec 183",
"title2" : "bga 351",
"title3" : "cha 293",
"title4" : {"title5": "title6", "title7": [ -1621626123.23234, 652238322.122]}
}]

I would like to extract the information in "title7" for each "block of code" (e.g. if there are "n" blocks). This would look something like this:

# final result

  id      title7_1    title7_2
 1 -1.766235e 04    987621.8
......
  n -1.621626e 09 652238322.1

I was able to bring this website into R.

library(rvest)
library(httr)
library(XML)

url <- " https://mywebsite.com"
page <- read_html(url)

This looks something like this:

>page

{html_document]
<html>

[1] <body><p>[\r\n{r\n "title1": "abc 123", \r\n  "title2" : "bca 321" , \r\n, .....

I tried to further inspect this:

>str(page)

List of 2
$node: <externalptr>
$doc: <externalptr>
- attr(*, "class")= chr [1:2]  "xml_document" "xml_node"

But from here, I don't know what to do. I thought that maybe I could try to extract this information like this - But this produces an empty output::

library(dplyr)
page %>% html_nodes("title4")

{xml_nodeset (0)}

I have feeling that I am not doing this correctly seeing as this website does not appear to be in typical HTML format and actually might be in JSON format. Can someone please show me how to fix this problem?

Thanks!

Note: It looks like the website was successfully loaded into R and the "page" object was successfully create. Perhaps some function like "grep/gsub" could be used to extract these numbers the "page" object? Seeing that the information from the website has already been imported into R - can something be done that takes advantage of this fact?

CodePudding user response:

The page might require JavaScript if you cannot access it with rvest, provided that you have the css/XPath correct. You can go to Chrome -> Inspect -> Network -> XHR -> and search for the json or api that stores your data.

CodePudding user response:

You're close. I set up a similar page on an internal website.

library(rvest)
res <- read_html("https://...../quux.json") # not a real domain here
res
# {html_document}
# <html>
# [1] <body><p>[\n{\n"title1" : "abc 123",\n"title2" : "bca 321",\n"title3" : "cba 213 ...

We just need to get to the html text:

quux <- jsonlite::parse_json(html_text(res), simplifyDataFrame = TRUE)
str(quux)
# 'data.frame': 2 obs. of  4 variables:
#  $ title1: chr  "abc 123" "aec 183"
#  $ title2: chr  "bca 321" "bga 351"
#  $ title3: chr  "cba 213" "cha 293"
#  $ title4:'data.frame':   2 obs. of  2 variables:
#   ..$ title5: chr  "title6" "title6"
#   ..$ title7:List of 2
#   .. ..$ : num  -17662 987622
#   .. ..$ : num  -1.62e 09 6.52e 08

From here, you can subset the list to get just the components you need, such as this matrix:

do.call(rbind, quux$title4$title7)
#               [,1]        [,2]
# [1,] -1.766235e 04    987621.8
# [2,] -1.621626e 09 652238322.1
  • Related