I have a pretty good understanding of R but am new to JSON file types and best practices for parsing. I'm having difficulties building a data frame from a raw JSON file. The JSON file (data below) is made up of repeated measure data that has multiple observations per user.
When the raw file is read into r
jdata<-read_json("./raw.json")
It comes in as a "List of 1" with that list being user_ids. Within each user_id are further lists, like so -
jdata$user_id$`sjohnson`$date$`2020-09-25`$city
The very last position actually splits into two options - $city or $zip. At the highest level, there are about 89 users in the complete file.
My goal would be to end up with a rectangular data frame or multiple data frames that I can merge together like this - where I don't actually need the zip code.
I've tried jsonlite along with tidyverse and the farthest I seem to get is a data frame with one variable at the smallest level - cities and zip codes alternating rows using this
df <- as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))
Any help/suggestions to get closer to the table above would be much appreciated. I have a feeling I'm failing at looping it back through the different levels.
Here is an example of the raw json file structure:
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
},
"asmith: {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"City": "Elmhurst",
"zip": "00013
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
CodePudding user response:
We can build our desired structure step by step:
library(jsonlite)
library(tidyverse)
df <- fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(city = unlist(value)) %>%
filter(var == 'city') %>%
select(-var, -value)
which gives:
# A tibble: 5 x 3
user_id date city
<chr> <chr> <chr>
1 sjohnson 2020-09-25 Denver
2 sjohnson 2020-10-01 Atlanta
3 sjohnson 2020-11-04 Jacksonville
4 asmith 2020-10-16 Cleavland
5 asmith 2020-11-10 Elmhurst
CodePudding user response:
Here's a solution in the tidyverse
: a custom function unnestable()
designed to recursively unnest into a table the contents of a list
like you describe. See Details for particulars regarding the format of such a list and its table.
Solution
First ensure the necessary libraries are present:
library(jsonlite)
library(tidyverse)
Then define the unnestable()
function as follows:
unnestable <- function(v) {
# If we've reached the bottommost list, simply treat it as a table...
if(all(sapply(
X = v,
# Check that each element is a single value (or NULL).
FUN = function(x) {
is.null(x) || purrr::is_scalar_atomic(x)
},
simplify = TRUE
))) {
v %>%
# Replace any NULLs with NAs to preserve blank fields...
sapply(
FUN = function(x) {
if(is.null(x))
NA
else
x
},
simplify = FALSE
) %>%
# ...and convert this bottommost list into a table.
tidyr::as_tibble()
}
# ...but if this list contains another nested list, then recursively unnest its
# contents and combine their tabular results.
else if(purrr::is_scalar_list(v)) {
# Take the contents within the nested list...
v[[1]] %>%
# ...apply this 'unnestable()' function to them recursively...
sapply(
FUN = unnestable,
simplify = FALSE,
USE.NAMES = TRUE
) %>%
# ...and stack their results.
dplyr::bind_rows(.id = names(v)[1])
}
# Otherwise, the format is unrecognized and yields no results.
else {
NULL
}
}
Finally, process the JSON data as follows:
# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")
# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)
# View the raw table.
flat_data
Naturally, you can reformat this table however you desire:
library(lubridate)
flat_data <- flat_data %>%
dplyr::transmute(
user_id = as.character(user_id),
date = lubridate::as_datetime(date),
city = as.character(city)
) %>%
dplyr::distinct()
# View the reformatted table.
flat_data
Results
Given a raw.json
file like that sampled here
{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}
then unnestable()
will yield a tibble
like this
# A tibble: 6 x 6
user_id date city zip location timestamp
<chr> <chr> <chr> <chr> <lgl> <dbl>
1 sjohnson 2020-09-25 Denver 80014 NA NA
2 sjohnson 2020-10-01 Atlanta 30301 NA NA
3 sjohnson 2020-11-04 Jacksonville 14001 NA NA
4 asmith 2020-10-16 Cleavland 34321 NA NA
5 asmith 2020-11-10 Elmhurst 00013 NA NA
6 asmith 2020-11-10 08:49:36 NA NA NA 1605016176013
which dplyr
will format into the result below:
# A tibble: 6 x 3
user_id date city
<chr> <dttm> <chr>
1 sjohnson 2020-09-25 00:00:00 Denver
2 sjohnson 2020-10-01 00:00:00 Atlanta
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith 2020-10-16 00:00:00 Cleavland
5 asmith 2020-11-10 00:00:00 Elmhurst
6 asmith 2020-11-10 08:49:36 NA
Details
List Format
To be precise, the list
represents nested groupings by the fields {group_1
, group_2
, ..., group_n
}, and it must be of the form:
list(
group_1 = list(
"value_1" = list(
group_2 = list(
"value_1.1" = list(
# .
# .
# .
group_n = list(
"value_1.1.….n.1" = list(
field_a = 1,
field_b = TRUE
),
"value_1.1.….n.2" = list(
field_a = 2,
field_c = "2"
)
# ...
)
),
"value_1.2" = list(
# .
# .
# .
)
# ...
)
),
"value_2" = list(
group_2 = list(
"value_2.1" = list(
# .
# .
# .
group_n = list(
"value_2.1.….n.1" = list(
field_a = 3,
field_d = 3.0
)
# ...
)
),
"value_2.2" = list(
# .
# .
# .
)
# ...
)
)
# ...
)
)
Table Format
Given a list
of this form, unnestable()
will flatten it into a table of the following form:
# A tibble: … x …
group_1 group_2 ... group_n field_a field_b field_c field_d
<chr> <chr> ... <chr> <dbl> <lgl> <chr> <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1 1 TRUE NA NA
2 value_1 value_1.1 ... value_1.1.….n.2 2 NA 2 NA
3 value_1 value_1.2 ... value_1.2.….n.1 ... ... ... ...
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
j value_2 value_2.1 ... value_2.1.….n.1 3 NA NA 3
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
k value_2 value_2.2 ... value_2.2.….n.1 ... ... ... ...
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮