Home > database >  JSON file with duplicate keys to a dataframe or excel file
JSON file with duplicate keys to a dataframe or excel file

Time:12-05

I have a huge JSON file with duplicate keys in each object, simplified example:

[ 
  {
   "a": 3,
   "b": "Banana",
   "c": 45,
   "a": 3,
   "a": 8,
   }
]

of course, my data has many more keys and objects, but this is a good snippet. and I'd like it to look like this:

|a|   b  |c |
-------------
|3|Banana|45|
|3|Banana|45|
|8|Banana|45|

I'm not picky, anything on excel, R, python... but none of the json parsers I've seen allow duplicates like this.

I've searched a lot, but I haven't found an answer. Is there any way I can do this and not have to do it manually? The dataset is HUGE. PS I know it's not favorable for json to have multiple duplicate keys. Both the key names and values have duplicates, and I need all of them, but I was given the file this way.

CodePudding user response:

Here's an R solution.

Premise: partially un-jsonify into lists with duplicate names, convert into frames individually, then aggregate into one frame.

I'll augment the data slightly do demonstrate more than one dictionary:

json <- '[ 
  {
   "a": 3,
   "b": "Banana",
   "c": 45,
   "a": 3,
   "a": 8
   }, 
   {
   "a": 4,
   "b": "Pear",
   "c": 46,
   "a": 4,
   "a": 9
   }
 ]'

Here's the code:

L <- jsonlite::fromJSON(json, simplifyDataFrame=FALSE)
L2 <- lapply(L, function(z) as.data.frame(split(unlist(z, use.names=FALSE), names(z))))
do.call(rbind, L2)
#   a      b  c
# 1 3 Banana 45
# 2 3 Banana 45
# 3 8 Banana 45
# 4 4   Pear 46
# 5 4   Pear 46
# 6 9   Pear 46

CodePudding user response:

Maybe I can help with the duplicate keys issue, they are the main problem, IMO.

In Python, there is a way how to deal with duplicate keys in JSON. You could define an own "hook" that processes key:value pairs.

In your example, the key "a" is present 3 times. Here is a demo that gives all such multiple keys unique names by appending consecutive numbers "_1", "_2", "_3", etc. (If there is a chance of name clash with an existing key like "a_1", change the naming format.)

The result is a valid dict you can process as you like.

import collections
import json

data = """ 
[ 
  {
   "a": 3,
   "b": "Banana",
   "c": 45,
   "a": 3,
   "a": 8
   }
]
"""

def object_pairs(pairs):
    dups = {d:1 for d, i in collections.Counter(pair[0] for pair in pairs).items() if i > 1}
    #       ^^^ change to d:0 for zero-based counting
    dedup = {}
    for k, v in pairs:
        try:
            num = dups[k]
            dups[k]  = 1
            k = f"{k}_{num}"
        except KeyError:
            pass
        dedup[k] = v 
    return dedup

result = json.loads(data, object_pairs_hook=object_pairs)
print(result) # [{'a_1': 3, 'b': 'Banana', 'c': 45, 'a_2': 3, 'a_3': 8}]
  • Related