R jsonlite does not parse JSON as per keys correctly-CodePudding

When converting json to columns in R using jsonlite, I came across a bug where the column values were in the wrong columns. I tried replicating it -

R> dt <- data.table(id = 1:2, 
json = c('{"user": "xyz2", "weightmap": "w1", "domains": "d1"}', 
         '{"weightmap": "w2", "user": "abcd", "domains": "d2"}'))                                                                                                                                                                       

R> dt[,fromJSON(json),.(id)]                                                                                                                                                                                                                                                                                                  
   id user weightmap domains
1:  1 xyz2        w1      d1
2:  2   w2      abcd      d2

From what I understand, jsonlite picks the keys from the first row and expects the json in all rows to be in the same order.

I also tried this with different datatypes. Eg. If weightmap was a complex datastructure like a list or another nested json, it would fail saying the columns types are not consistent.

Is this expected behavior ? Should'nt it always read from the keys and split to cols ?

Is there a way to make jsonlite parse the json correctly ?

CodePudding user response：

I wonder if this is a "bug" in data.table, where it is internally treating each as a list instead of a data.frame with names that need to be aligned. We can force it this way:

dt[, c(.SD, rbindlist(lapply(json, jsonlite::fromJSON), use.names=TRUE))][, json := NULL][]
#       id   user weightmap domains
#    <int> <char>    <char>  <char>
# 1:     1   xyz2        w1      d1
# 2:     2   abcd        w2      d2

If there is any complexity, what happens depends on the structure. For example, if a weightmap is a list, then

dt <- data.table(id = 1:2, 
    json = c('{"user": "xyz2", "weightmap": ["w1","w2"], "domains": "d1"}', 
             '{"weightmap": "w2", "user": "abcd", "domains": "d2"}'))
dt[, c(.SD, rbindlist(lapply(json, jsonlite::fromJSON), use.names=TRUE))][, json := NULL][]
# Warning in as.data.table.list(jval, .named = NULL) :
#   Item 1 has 2 rows but longest item has 3; recycled with remainder.
# Warning in as.data.table.list(jval, .named = NULL) :
#   Item 2 has 2 rows but longest item has 3; recycled with remainder.
#       id   user weightmap domains
#    <int> <char>    <char>  <char>
# 1:     1   xyz2        w1      d1
# 2:     2   xyz2        w2      d1
# 3:     1   abcd        w2      d2

which achieves an "unnesting" (noisily).

CodePudding user response：

dt[,':='(jsonb=list(as.data.table(fromJSON(json)))),.(id)]
dt[,setcolorder(jsonb[[1]], sort(names(jsonb[[1]]))),.(id)]

The issue was that data table uses an implicit rbindlist over the lists generated for the groups. rbindlist always binds as per position unless use.names is given.

We resolved this issue by setting the colorder explicitly from the first json row so that the order of json fields are maintained.