Home > Net >  Parse multi-level json file in r
Parse multi-level json file in r

Time:10-28

I have a pretty good understanding of R but am new to JSON file types and best practices for parsing. I'm having difficulties building a data frame from a raw JSON file. The JSON file (data below) is made up of repeated measure data that has multiple observations per user.

When the raw file is read into r

 jdata<-read_json("./raw.json")

It comes in as a "List of 1" with that list being user_ids. Within each user_id are further lists, like so -

jdata$user_id$`sjohnson`$date$`2020-09-25`$city

The very last position actually splits into two options - $city or $zip. At the highest level, there are about 89 users in the complete file.

My goal would be to end up with a rectangular data frame or multiple data frames that I can merge together like this - where I don't actually need the zip code.

example table

I've tried jsonlite along with tidyverse and the farthest I seem to get is a data frame with one variable at the smallest level - cities and zip codes alternating rows using this

df  <-  as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))

Any help/suggestions to get closer to the table above would be much appreciated. I have a feeling I'm failing at looping it back through the different levels.

Here is an example of the raw json file structure:

 {
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
              "city": "Denver",
              "zip": "80014"
            },
            "2020-10-01": {
              "city": "Atlanta",
              "zip": "30301"
            },
            "2020-11-04": {
              "city": "Jacksonville",
              "zip": "14001"
            }
         },
    "asmith: {
      "date": {
        "2020-10-16": {
              "city": "Cleavland",
              "zip": "34321"
        },
        "2020-11-10": {
              "City": "Elmhurst",
              "zip": "00013
            },
            "2020-11-10 08:49:36": {
              "location": null,
              "timestamp": 1605016176013
            }
          }
 

CodePudding user response:

We can build our desired structure step by step:

library(jsonlite)
library(tidyverse)

df <- fromJSON('{
   "user_id": {
    "sjohnson": {
       "date": {
        "2020-09-25": {
           "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
         },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
       }
    },
    "asmith": {
       "date": {
         "2020-10-16": {
           "city": "Cleavland",
           "zip": "34321"
         },
        "2020-11-10": {
           "city": "Elmhurst",
           "zip": "00013"
         },
         "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
       }
     }
   }
}')

df %>%
  bind_rows() %>%
  pivot_longer(everything(), names_to = 'user_id') %>%
  unnest_longer(value, indices_to = 'date') %>%
  unnest_longer(value, indices_to = 'var') %>%
  mutate(city = unlist(value)) %>%
  filter(var == 'city') %>%
  select(-var, -value)

which gives:

# A tibble: 5 x 3
  user_id  date       city        
  <chr>    <chr>      <chr>       
1 sjohnson 2020-09-25 Denver      
2 sjohnson 2020-10-01 Atlanta     
3 sjohnson 2020-11-04 Jacksonville
4 asmith   2020-10-16 Cleavland   
5 asmith   2020-11-10 Elmhurst 

CodePudding user response:

Here's a solution in the tidyverse: a custom function unnestable() designed to recursively unnest into a table the contents of a list like you describe. See Details for particulars regarding the format of such a list and its table.

Solution

First ensure the necessary libraries are present:

library(jsonlite)
library(tidyverse)

Then define the unnestable() function as follows:

unnestable <- function(v) {
  # If we've reached the bottommost list, simply treat it as a table...
  if(all(sapply(
    X = v,
    # Check that each element is a single value (or NULL).
    FUN = function(x) {
      is.null(x) || purrr::is_scalar_atomic(x)
    },
    simplify = TRUE
  ))) {
    v %>%
      # Replace any NULLs with NAs to preserve blank fields...
      sapply(
        FUN = function(x) {
          if(is.null(x))
            NA
          else
            x
        },
        simplify = FALSE
      ) %>%
      # ...and convert this bottommost list into a table.
      tidyr::as_tibble()
  }
  # ...but if this list contains another nested list, then recursively unnest its
  # contents and combine their tabular results.
  else if(purrr::is_scalar_list(v)) {
    # Take the contents within the nested list...
    v[[1]] %>%
      # ...apply this 'unnestable()' function to them recursively...
      sapply(
        FUN = unnestable,
        simplify = FALSE,
        USE.NAMES = TRUE
      ) %>%
      # ...and stack their results.
      dplyr::bind_rows(.id = names(v)[1])
  }
  # Otherwise, the format is unrecognized and yields no results.
  else {
    NULL
  }
}

Finally, process the JSON data as follows:

# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")


# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)


# View the raw table.
flat_data

Naturally, you can reformat this table however you desire:

library(lubridate)

flat_data <- flat_data %>%
  dplyr::transmute(
    user_id = as.character(user_id),
    date = lubridate::as_datetime(date),
    city = as.character(city)
  ) %>%
  dplyr::distinct()


# View the reformatted table.
flat_data

Results

Given a raw.json file like that sampled here

{
  "user_id": {
    "sjohnson": {
      "date": {
        "2020-09-25": {
          "city": "Denver",
          "zip": "80014"
        },
        "2020-10-01": {
          "city": "Atlanta",
          "zip": "30301"
        },
        "2020-11-04": {
          "city": "Jacksonville",
          "zip": "14001"
        }
      }
    },
    "asmith": {
      "date": {
        "2020-10-16": {
          "city": "Cleavland",
          "zip": "34321"
        },
        "2020-11-10": {
          "city": "Elmhurst",
          "zip": "00013"
        },
        "2020-11-10 08:49:36": {
          "location": null,
          "timestamp": 1605016176013
        }
      }
    }
  }
}

then unnestable() will yield a tibble like this

# A tibble: 6 x 6
  user_id  date                city         zip   location     timestamp
  <chr>    <chr>               <chr>        <chr> <lgl>            <dbl>
1 sjohnson 2020-09-25          Denver       80014 NA                  NA
2 sjohnson 2020-10-01          Atlanta      30301 NA                  NA
3 sjohnson 2020-11-04          Jacksonville 14001 NA                  NA
4 asmith   2020-10-16          Cleavland    34321 NA                  NA
5 asmith   2020-11-10          Elmhurst     00013 NA                  NA
6 asmith   2020-11-10 08:49:36 NA           NA    NA       1605016176013

which dplyr will format into the result below:

# A tibble: 6 x 3
  user_id  date                city        
  <chr>    <dttm>              <chr>       
1 sjohnson 2020-09-25 00:00:00 Denver      
2 sjohnson 2020-10-01 00:00:00 Atlanta     
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith   2020-10-16 00:00:00 Cleavland   
5 asmith   2020-11-10 00:00:00 Elmhurst    
6 asmith   2020-11-10 08:49:36 NA          

Details

List Format

To be precise, the list represents nested groupings by the fields {group_1, group_2, ..., group_n}, and it must be of the form:

list(
  group_1 = list(
    "value_1" = list(
      group_2 = list(
        "value_1.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_1.1.….n.1" = list(
                   field_a =    1,
                   field_b = TRUE
                 ),
                 "value_1.1.….n.2" = list(
                   field_a =   2,
                   field_c = "2"
                 )
                 # ...
               )
        ),
        "value_1.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    ),
    "value_2" = list(
      group_2 = list(
        "value_2.1" = list(
          # .
          #  .
          #   .
               group_n = list(
                 "value_2.1.….n.1" = list(
                   field_a =   3,
                   field_d = 3.0
                 )
                 # ...
               )
        ),
        "value_2.2" = list(
          # .
          #  .
          #   .
        )
        # ...
      )
    )
    # ...
  )
)

Table Format

Given a list of this form, unnestable() will flatten it into a table of the following form:

# A tibble: … x …
  group_1 group_2   ... group_n         field_a field_b field_c field_d
  <chr>   <chr>     ... <chr>             <dbl> <lgl>   <chr>     <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1       1 TRUE    NA           NA
2 value_1 value_1.1 ... value_1.1.….n.2       2 NA      2            NA
3 value_1 value_1.2 ... value_1.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
j value_2 value_2.1 ... value_2.1.….n.1       3 NA      NA            3
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
k value_2 value_2.2 ... value_2.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
  • Related