When converting json to columns in R using jsonlite, I came across a bug where the column values were in the wrong columns. I tried replicating it -
R> dt <- data.table(id = 1:2,
json = c('{"user": "xyz2", "weightmap": "w1", "domains": "d1"}',
'{"weightmap": "w2", "user": "abcd", "domains": "d2"}'))
R> dt[,fromJSON(json),.(id)]
id user weightmap domains
1: 1 xyz2 w1 d1
2: 2 w2 abcd d2
From what I understand, jsonlite picks the keys from the first row and expects the json in all rows to be in the same order.
I also tried this with different datatypes. Eg. If weightmap
was a complex datastructure like a list or another nested json, it would fail saying the columns types are not consistent.
Is this expected behavior ? Should'nt it always read from the keys and split to cols ?
Is there a way to make jsonlite parse the json correctly ?
CodePudding user response:
I wonder if this is a "bug" in data.table
, where it is internally treating each as a list
instead of a data.frame
with names that need to be aligned. We can force it this way:
dt[, c(.SD, rbindlist(lapply(json, jsonlite::fromJSON), use.names=TRUE))][, json := NULL][]
# id user weightmap domains
# <int> <char> <char> <char>
# 1: 1 xyz2 w1 d1
# 2: 2 abcd w2 d2
If there is any complexity, what happens depends on the structure. For example, if a weightmap
is a list, then
dt <- data.table(id = 1:2,
json = c('{"user": "xyz2", "weightmap": ["w1","w2"], "domains": "d1"}',
'{"weightmap": "w2", "user": "abcd", "domains": "d2"}'))
dt[, c(.SD, rbindlist(lapply(json, jsonlite::fromJSON), use.names=TRUE))][, json := NULL][]
# Warning in as.data.table.list(jval, .named = NULL) :
# Item 1 has 2 rows but longest item has 3; recycled with remainder.
# Warning in as.data.table.list(jval, .named = NULL) :
# Item 2 has 2 rows but longest item has 3; recycled with remainder.
# id user weightmap domains
# <int> <char> <char> <char>
# 1: 1 xyz2 w1 d1
# 2: 2 xyz2 w2 d1
# 3: 1 abcd w2 d2
which achieves an "unnesting" (noisily).
CodePudding user response:
dt[,':='(jsonb=list(as.data.table(fromJSON(json)))),.(id)]
dt[,setcolorder(jsonb[[1]], sort(names(jsonb[[1]]))),.(id)]
The issue was that data table uses an implicit rbindlist over the lists generated for the groups. rbindlist always binds as per position unless use.names
is given.
We resolved this issue by setting the colorder explicitly from the first json row so that the order of json fields are maintained.