Home > Back-end >  Parsing JSON in R: lexical error - invalid char in json text
Parsing JSON in R: lexical error - invalid char in json text

Time:11-21

I have a file ("my_file") in R that looks something like this:

  NAME                                                                                                                                                                                     Address_Parse
1 name1 [('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]
2 name2 [('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]

In case the structure is confusing, here is how the file looks like

my_file = structure(list(NAME = c("name1", "name2"), Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]", 
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))

Objective: For each row, I would like to take each of the "elements" (e.g. "StreetNumber", "StreetName", "StreetType", etc.) and convert it into a new column. This would look something like this:

   name StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode
1 name1          372      river         St               S      toronto       ON     A1C9R7
2 name2          208      ocean         St               E      Toronto       ON     J8N1G8

To me, it appears that the address field is in JSON format (I could be wrong about this). I tried to look at different ways I could parse the JSON. For example, I tried to apply the answer provided here (R: convert nested JSON in a data frame column to addtional columns in the same data frame):

library(dplyr)
library(tidyr)
library(purrr)
library(jsonlite)

final = my_file %>%
  mutate(
    json_parsed = map(Address_Parse, ~ fromJSON(., flatten=TRUE))
  ) %>%
  unnest(json_parsed)

However, this is giving me the following error:

Error in `mutate()`:
! Problem while computing `json_parsed = map(Address_Parse, ~fromJSON(., flatten = TRUE))`.
Caused by error:
! lexical error: invalid char in json text.
                                      [('372', 'StreetNumber'), ('rive
                     (right here) ------^
Run `rlang::last_error()` to see where the error occurred.

I then tried another approach:

final <- my_file %>% 
          rowwise() %>%
          do(data.frame(fromJSON(.$Address_Parse , flatten = T))) %>%
          ungroup() %>%
          bind_cols(my_file  %>% select(-Address_Parse ))

But I now get a new error:

Error: lexical error: invalid char in json text.
                                      [('372', 'StreetNumber'), ('rive
                     (right here) ------^

Can someone please show me to resolve this?

Thank you!

CodePudding user response:

You might have to rejig the formatting of the JSON slightly to get it to work.
I've used the stream_in function instead of fromJSON as it is usually quicker and takes care of a lot of stuff somewhat automatically.

library(jsonlite)
out <- stream_in(textConnection(chartr("()'", '[]"', my_file$Address_Parse)))
s <- seq(1, ncol(out)/2)
setNames(out[s], unlist(out[1, -s]))

#  StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode PostalCode
#1          372      river         St               S      toronto       ON        A1C        9R7
#2          208      ocean         St               E      Toronto       ON        J8N        1G8

CodePudding user response:

We may need some modifications in the text before using fromJSON - i.e. keep the format as "key":value instead of (value, 'key') and also insert the {, } after, before the [, ]

library(dplyr)
library(purrr)
library(stringr)
library(jsonlite)
library(tidyr)
my_file  %>% 
  mutate(Address_Parse = str_replace_all(Address_Parse,
      "\\(([^,] ),\\s*([^)] )\\)", "\\2:\\1") %>% 
   str_replace(fixed("["), "[{") %>%
   str_replace(fixed("]"), "}]") %>%
   str_replace_all(fixed("'"), '"') %>% 
   map(fromJSON)) %>%
   unnest(Address_Parse) %>%
 type.convert(as.is = TRUE)

-output

 A tibble: 2 × 8
  NAME  StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode
  <chr>        <int> <chr>      <chr>      <chr>           <chr>        <chr>    <chr>     
1 name1          372 river      St         S               toronto      ON       A1C       
2 name2          208 ocean      St         E               Toronto      ON       J8N       

Or using reticulate as it seems to be tuple

library(reticulate)
py_run_string(paste0("tmp=", paste(my_file$Address_Parse, 
            collapse = ",")))
out <- cbind(my_file[1], do.call(rbind, lapply(py$tmp, \(x) 
  do.call(cbind, lapply(x, \(y) setNames(data.frame(y[[1]]), 
    y[[2]]))))))

-output

> out
   NAME StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode PostalCode
1 name1          372      river         St               S      toronto       ON        A1C        9R7
2 name2          208      ocean         St               E      Toronto       ON        J8N        1G8
  • Related