I would like to add a JSON column to an existing data.frame that contains most each row's information. The code below displays my current approach and achieves my desired result.
# Establish test data
testData <- data.table::data.table(id = 1:10000,
var1 = rep(letters[1:10], times = 1000),
var2 = rep(letters[11:20], times = 1000))
# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")
# Initialize JSON column
testData[["json"]] <- as.character(NA)
# Loop through and populate JSON column
for(i in 1:nrow(testData)) {
cat(paste0("Running ", i, " of ", nrow(testData), "\n"))
testData[i,][["json"]] <- gsub('^.|.$','',jsonlite::toJSON(testData[i,jsonVars,with=F]))
}
# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]
As you can see, the result I want is just an "id" field and "json" field that contains all of the row's information.
Although the approach above works, it is too slow because my "real" dataset has millions of records and ~300 columns.
I've tried to use apply()
(see below), but the performance doesn't improve
# Establish test data
testData <- data.table::data.table(id = 1:10000,
var1 = rep(letters[1:10], times = 1000),
var2 = rep(letters[11:20], times = 1000))
# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")
# Initialize JSON column
testData[["json"]] <- as.character(NA)
# Use apply to populate JSON column
jsonFUN <- function(x) { gsub('^.|.$','',jsonlite::toJSON(testData[,jsonVars,with=F])) }
testData$json <- apply(X = testData, MARGIN = 1, FUN = jsonFUN)
# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]
Does anyone know an efficient method to accomplish this task more quickly?
Perhaps the answer is better use of data.table
tools. The documentation for jsonlite::toJSON
also mentions processing for vectors, data.frames, and arrays, but I haven't been able to find a solution.
Your assistance is appreciated!
CodePudding user response:
jsonlite
This isn't really fast, but it works and is a canonical json method:
testData[, .(json = jsonlite::toJSON(.SD)), by = id
][, json := gsub("^\\[|\\]$", "", json)][]
# id json
# <int> <json>
# 1: 1 {"var1":"a","var2":"k"}
# 2: 2 {"var1":"b","var2":"l"}
# 3: 3 {"var1":"c","var2":"m"}
# 4: 4 {"var1":"d","var2":"n"}
# 5: 5 {"var1":"e","var2":"o"}
# 6: 6 {"var1":"f","var2":"p"}
# 7: 7 {"var1":"g","var2":"q"}
# 8: 8 {"var1":"h","var2":"r"}
# 9: 9 {"var1":"i","var2":"s"}
# 10: 10 {"var1":"j","var2":"t"}
# ---
# 9991: 9991 {"var1":"a","var2":"k"}
# 9992: 9992 {"var1":"b","var2":"l"}
# 9993: 9993 {"var1":"c","var2":"m"}
# 9994: 9994 {"var1":"d","var2":"n"}
# 9995: 9995 {"var1":"e","var2":"o"}
# 9996: 9996 {"var1":"f","var2":"p"}
# 9997: 9997 {"var1":"g","var2":"q"}
# 9998: 9998 {"var1":"h","var2":"r"}
# 9999: 9999 {"var1":"i","var2":"s"}
# 10000: 10000 {"var1":"j","var2":"t"}
The default output for each row is actually "[{"var1":"a","var2":"k"}]"
(a list of length 1), I removed that with the gsub
pattern. If you don't mind the length-1 list then you can simplify out the gsub
.
Also, this assumes that rows are unique by id
; if you have multiple rows for an id
then this will produce different results (for those rows).
sprintf
This is a hack and not nearly as general, but it's likely much faster:
testData[, .(id, json = sprintf('{"var1":%s,"var2":%s}', dQuote(var1, FALSE), dQuote(var2, FALSE)))]
I caution against using this for anything more complex, as using jsonlite::toJSON
is the more cautious/correct method for anything non-trivial.