Home > Software engineering >  R JSON: Quickly create a json field as a new variable
R JSON: Quickly create a json field as a new variable

Time:12-16

I would like to add a JSON column to an existing data.frame that contains most each row's information. The code below displays my current approach and achieves my desired result.

# Establish test data
testData <- data.table::data.table(id = 1:10000,
                                   var1 = rep(letters[1:10], times = 1000),
                                   var2 = rep(letters[11:20], times = 1000))

# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")

# Initialize JSON column
testData[["json"]] <- as.character(NA)

# Loop through and populate JSON column
for(i in 1:nrow(testData)) {
  cat(paste0("Running ", i, " of ", nrow(testData), "\n"))
  testData[i,][["json"]] <- gsub('^.|.$','',jsonlite::toJSON(testData[i,jsonVars,with=F]))
}

# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]

As you can see, the result I want is just an "id" field and "json" field that contains all of the row's information.

enter image description here

Although the approach above works, it is too slow because my "real" dataset has millions of records and ~300 columns.

I've tried to use apply() (see below), but the performance doesn't improve

# Establish test data
testData <- data.table::data.table(id = 1:10000,
                                   var1 = rep(letters[1:10], times = 1000),
                                   var2 = rep(letters[11:20], times = 1000))

# Establish which variables will be JSON-ed
jsonVars <- c("var1","var2")

# Initialize JSON column
testData[["json"]] <- as.character(NA)

# Use apply to populate JSON column
jsonFUN <- function(x) { gsub('^.|.$','',jsonlite::toJSON(testData[,jsonVars,with=F])) }
testData$json <- apply(X = testData, MARGIN = 1, FUN = jsonFUN)

# Keep only the identifier and JSON fields
testData <- testData[,c("id","json"),with=F]

Does anyone know an efficient method to accomplish this task more quickly?

Perhaps the answer is better use of data.table tools. The documentation for jsonlite::toJSON also mentions processing for vectors, data.frames, and arrays, but I haven't been able to find a solution.

Your assistance is appreciated!

CodePudding user response:

jsonlite

This isn't really fast, but it works and is a canonical json method:

testData[, .(json = jsonlite::toJSON(.SD)), by = id
  ][, json := gsub("^\\[|\\]$", "", json)][]
#           id                    json
#        <int>                  <json>
#     1:     1 {"var1":"a","var2":"k"}
#     2:     2 {"var1":"b","var2":"l"}
#     3:     3 {"var1":"c","var2":"m"}
#     4:     4 {"var1":"d","var2":"n"}
#     5:     5 {"var1":"e","var2":"o"}
#     6:     6 {"var1":"f","var2":"p"}
#     7:     7 {"var1":"g","var2":"q"}
#     8:     8 {"var1":"h","var2":"r"}
#     9:     9 {"var1":"i","var2":"s"}
#    10:    10 {"var1":"j","var2":"t"}
#    ---                              
#  9991:  9991 {"var1":"a","var2":"k"}
#  9992:  9992 {"var1":"b","var2":"l"}
#  9993:  9993 {"var1":"c","var2":"m"}
#  9994:  9994 {"var1":"d","var2":"n"}
#  9995:  9995 {"var1":"e","var2":"o"}
#  9996:  9996 {"var1":"f","var2":"p"}
#  9997:  9997 {"var1":"g","var2":"q"}
#  9998:  9998 {"var1":"h","var2":"r"}
#  9999:  9999 {"var1":"i","var2":"s"}
# 10000: 10000 {"var1":"j","var2":"t"}

The default output for each row is actually "[{"var1":"a","var2":"k"}]" (a list of length 1), I removed that with the gsub pattern. If you don't mind the length-1 list then you can simplify out the gsub.

Also, this assumes that rows are unique by id; if you have multiple rows for an id then this will produce different results (for those rows).

sprintf

This is a hack and not nearly as general, but it's likely much faster:

testData[, .(id, json = sprintf('{"var1":%s,"var2":%s}', dQuote(var1, FALSE), dQuote(var2, FALSE)))]

I caution against using this for anything more complex, as using jsonlite::toJSON is the more cautious/correct method for anything non-trivial.

  • Related