I have a file ("my_file") in R that looks something like this:
NAME Address_Parse
1 name1 [('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]
2 name2 [('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]
In case the structure is confusing, here is how the file looks like
my_file = structure(list(NAME = c("name1", "name2"), Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]",
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))
Objective: For each row, I would like to take each of the "elements" (e.g. "StreetNumber", "StreetName", "StreetType", etc.) and convert it into a new column. This would look something like this:
name StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode
1 name1 372 river St S toronto ON A1C9R7
2 name2 208 ocean St E Toronto ON J8N1G8
To me, it appears that the address field is in JSON format (I could be wrong about this). I tried to look at different ways I could parse the JSON. For example, I tried to apply the answer provided here (R: convert nested JSON in a data frame column to addtional columns in the same data frame):
library(dplyr)
library(tidyr)
library(purrr)
library(jsonlite)
final = my_file %>%
mutate(
json_parsed = map(Address_Parse, ~ fromJSON(., flatten=TRUE))
) %>%
unnest(json_parsed)
However, this is giving me the following error:
Error in `mutate()`:
! Problem while computing `json_parsed = map(Address_Parse, ~fromJSON(., flatten = TRUE))`.
Caused by error:
! lexical error: invalid char in json text.
[('372', 'StreetNumber'), ('rive
(right here) ------^
Run `rlang::last_error()` to see where the error occurred.
I then tried another approach:
final <- my_file %>%
rowwise() %>%
do(data.frame(fromJSON(.$Address_Parse , flatten = T))) %>%
ungroup() %>%
bind_cols(my_file %>% select(-Address_Parse ))
But I now get a new error:
Error: lexical error: invalid char in json text.
[('372', 'StreetNumber'), ('rive
(right here) ------^
Can someone please show me to resolve this?
Thank you!
CodePudding user response:
You might have to rejig the formatting of the JSON slightly to get it to work.
I've used the stream_in
function instead of fromJSON
as it is usually quicker and takes care of a lot of stuff somewhat automatically.
library(jsonlite)
out <- stream_in(textConnection(chartr("()'", '[]"', my_file$Address_Parse)))
s <- seq(1, ncol(out)/2)
setNames(out[s], unlist(out[1, -s]))
# StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode PostalCode
#1 372 river St S toronto ON A1C 9R7
#2 208 ocean St E Toronto ON J8N 1G8
CodePudding user response:
We may need some modifications in the text before using fromJSON
- i.e. keep the format as "key":value
instead of (value, 'key')
and also insert the {
, }
after, before the [
, ]
library(dplyr)
library(purrr)
library(stringr)
library(jsonlite)
library(tidyr)
my_file %>%
mutate(Address_Parse = str_replace_all(Address_Parse,
"\\(([^,] ),\\s*([^)] )\\)", "\\2:\\1") %>%
str_replace(fixed("["), "[{") %>%
str_replace(fixed("]"), "}]") %>%
str_replace_all(fixed("'"), '"') %>%
map(fromJSON)) %>%
unnest(Address_Parse) %>%
type.convert(as.is = TRUE)
-output
A tibble: 2 × 8
NAME StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 name1 372 river St S toronto ON A1C
2 name2 208 ocean St E Toronto ON J8N
Or using reticulate
as it seems to be tuple
library(reticulate)
py_run_string(paste0("tmp=", paste(my_file$Address_Parse,
collapse = ",")))
out <- cbind(my_file[1], do.call(rbind, lapply(py$tmp, \(x)
do.call(cbind, lapply(x, \(y) setNames(data.frame(y[[1]]),
y[[2]]))))))
-output
> out
NAME StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode PostalCode
1 name1 372 river St S toronto ON A1C 9R7
2 name2 208 ocean St E Toronto ON J8N 1G8