I have a huge JSON file with duplicate keys in each object, simplified example:
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8,
}
]
of course, my data has many more keys and objects, but this is a good snippet. and I'd like it to look like this:
|a| b |c |
-------------
|3|Banana|45|
|3|Banana|45|
|8|Banana|45|
I'm not picky, anything on excel, R, python... but none of the json parsers I've seen allow duplicates like this.
I've searched a lot, but I haven't found an answer. Is there any way I can do this and not have to do it manually? The dataset is HUGE. PS I know it's not favorable for json to have multiple duplicate keys. Both the key names and values have duplicates, and I need all of them, but I was given the file this way.
CodePudding user response:
Here's an R solution.
Premise: partially un-jsonify into lists with duplicate names, convert into frames individually, then aggregate into one frame.
I'll augment the data slightly do demonstrate more than one dictionary:
json <- '[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
},
{
"a": 4,
"b": "Pear",
"c": 46,
"a": 4,
"a": 9
}
]'
Here's the code:
L <- jsonlite::fromJSON(json, simplifyDataFrame=FALSE)
L2 <- lapply(L, function(z) as.data.frame(split(unlist(z, use.names=FALSE), names(z))))
do.call(rbind, L2)
# a b c
# 1 3 Banana 45
# 2 3 Banana 45
# 3 8 Banana 45
# 4 4 Pear 46
# 5 4 Pear 46
# 6 9 Pear 46
CodePudding user response:
Maybe I can help with the duplicate keys issue, they are the main problem, IMO.
In Python, there is a way how to deal with duplicate keys in JSON. You could define an own "hook" that processes key:value pairs.
In your example, the key "a" is present 3 times. Here is a demo that gives all such multiple keys unique names by appending consecutive numbers "_1", "_2", "_3", etc. (If there is a chance of name clash with an existing key like "a_1", change the naming format.)
The result is a valid dict you can process as you like.
import collections
import json
data = """
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
}
]
"""
def object_pairs(pairs):
dups = {d:1 for d, i in collections.Counter(pair[0] for pair in pairs).items() if i > 1}
# ^^^ change to d:0 for zero-based counting
dedup = {}
for k, v in pairs:
try:
num = dups[k]
dups[k] = 1
k = f"{k}_{num}"
except KeyError:
pass
dedup[k] = v
return dedup
result = json.loads(data, object_pairs_hook=object_pairs)
print(result) # [{'a_1': 3, 'b': 'Banana', 'c': 45, 'a_2': 3, 'a_3': 8}]