Home > OS >  Convert a CSV into a JSON using a JSON Schema
Convert a CSV into a JSON using a JSON Schema

Time:11-11

How do I convert a flat table into a JSON?

I have previously converted JSONs into Flat Tables using both custom code and libraries. However, what I am aiming to do here is the reverse. Before going ahead and creating a custom library, I was wondering if anyone had encountered this problem before and if there was an existing solution to it.

When you flatten a JSON into a CSV, you loose the information on the structure, and therefore to reverse it, you need a document that describes how the JSON should be built, which ideally would be the standardised JSON Schema.

The following example shows a source CSV, the JSON Schema and the expected output.

User CSV

user_id, adress.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter

JSON Schema

This follows the defined standard with the addition of the "source" property. I am suggesting adding this custom property to this specific problem in order to map between the csv columns and the JSON values (leafs).

{
 "$schema": "https://json-schema.org/draft/2020-12/schema",
 "title": "User",
 "type": "object",
 "properties":{
  "user_id" : {"type":"integer", "source":"user_id"},
  "address":{
   "type":"object",
   "properties":{
    "city" : {"type":"string", "source":"adress.city"},
    "street" : {"type":"string", "source":"adress.street"},
    "number": { "type":"integer", "source":"adress.number"}
   }
  },
  "name" : {"type":"string", "source":"name"}},
  "aka":{
   "type": "array",
   "items" : {"type":"string", "source":"aka"}
  },
  "contacts":{
   "type":"array",
   "items":{
    "type":"object",
    "properties":{
     "name" : {"type":"string", "source":"contacts.name"},
     "relationship":{"type":"string", "source":"contacts.relationship"}
    },
   }
  }
 }
}

Expected JSON

{
 "user_id":1,
 "address":{
  "city":"Seattle",
  "street":"Atomic Street",
  "number":6910
 },
 "name":"Rick Sanchez",
 "aka":[
  "Rick",
  "Grandpa",
  "Albert Ein-douche",
  "Richard"
 ],
 "contacts":[
  {
   "name":"Morty",
   "relationship":"Grandson"
  },
  {
   "name":"Beth",
   "relationship":"Daughter"
  }
 ]
}

From the above we see that although there are 8 rows in the CSV, we are producing a single JSON Object (instead of 8) since there is only one unique user (user_id = 1). This could be inferred from the JSON Schema where the root element is an object and not a list.

If we did not specify a JSON Schema or other kind of mapping, you could simply assume no structure and just create 8 flat jsons as below

[
 {"user_id":1,"address.city":"Seattle", ... "aka":"Rick" ... "contacts.relationship":"Grandson"}
 ...
 {"user_id":1,"address.city":"Seattle", ... "aka":"Richard" ... "contacts.relationship":"Daughter"}
]

I am adding the Python tag since that is the language I use mostly, but in this case, the solution doesn't need to be in Python.

CodePudding user response:

I'm not entirely clear on why JSON schema would be needed for this, but if you wanted to, you could easily create a convenience function which can essentially "unflatten" the flat JSON that your CSV data would be mapped to, into a nested dictionary format as mentioned above.

The following example should demonstrate a simplified example of how this would work. Note the following two points:

  • In the CSV header, I've corrected a typo and renamed one of the columns to address.city; previously, it was adress.city, which would result in it getting mapped to another JSON path under a separate adress key, which might not be desirable.

  • I wasn't sure of the best way to handle this, but it looks like csv module only allows a single-character delimiter; in the CSV file, it looks like you have a comma and a space , as the separator, so I've just replaced all occurrences of this with a single comma , so that the split on the delimiter works as expected.

from csv import DictReader
from io import StringIO
from typing import Any


csv_data = StringIO("""\
user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
""".replace(', ', ',')
)


def unflatten_json(json_dict: dict):
    """Unflatten a JSON dictionary object, with keys like 'a.b.c'"""
    result_dict = {}

    for k, v in json_dict.items():
        *nested_parts, field_name = k.split('.')

        obj = result_dict
        for p in nested_parts:
            obj = obj.setdefault(p, {})

        obj[field_name] = v

    return result_dict


def main():
    reader = DictReader(csv_data)
    flat_json: list[dict[str, Any]] = list(reader)

    first_obj = flat_json[0]
    nested_dict = unflatten_json(first_obj)

    print('Flat JSON:   ', first_obj)
    print('Nested JSON: ', nested_dict)


if __name__ == '__main__':
    main()

The output is given below:

Flat JSON:    {'user_id': '1', 'address.city': 'Seattle', 'address.street': 'Atomic Street', 'address.number': '6910', 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts.name': 'Morty', 'contacts.relationship': 'Grandson'}
Nested JSON:  {'user_id': '1', 'address': {'city': 'Seattle', 'street': 'Atomic Street', 'number': '6910'}, 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts': {'name': 'Morty', 'relationship': 'Grandson'}}

Note, if you want to unflatten all JSON dictionary objects in the list, you could instead use a list comprehension as below:

result_list = [unflatten_json(d) for d in flat_json]

I would also point out that the above solution isn't perfect, as it will pass in everything as string values, for example in the case of 'user_id': '1'. To work around that, you can modify the unflatten_json function so it is like below:

...
for k, v in json_dict.items():
    ...

    try:
        v = int(v)
    except ValueError:
        pass

    obj[field_name] = v

Now the unflattened JSON object should be as below. Note that I'm pretty printing it with json.dumps(nested_dict, indent=2) so it's a little easier to see.

{
  "user_id": 1,
  "address": {
    "city": "Seattle",
    "street": "Atomic Street",
    "number": 6910
  },
  "name": "Rick Sanchez",
  "aka": "Rick",
  "contacts": {
    "name": "Morty",
    "relationship": "Grandson"
  }
}

Complete Solution

The full solution to achieve the desired output (data for all rows appended to aka and contacts) is provided below:

from csv import DictReader
from io import StringIO
from pprint import pprint


csv_data = StringIO("""\
user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
""".replace(', ', ',')
)


def unflatten_json(json_dict: dict[str, str]):
    """Unflatten a JSON dictionary object, with keys like 'a.b.c'"""
    result_dict = {}

    for k, v in json_dict.items():
        *nested_parts, field_name = k.split('.')

        obj = result_dict
        for p in nested_parts:
            obj = obj.setdefault(p, {})

        obj[field_name] = int(v) if v.isnumeric() else v

    return result_dict


def main():
    reader = DictReader(csv_data)

    rows = list(map(unflatten_json, reader))

    # retrieve the first element in the (unflattened) sequence
    result_obj = rows[0]
    # define list fields that we want to merge data for
    list_fields = ('aka', 'contacts')
    # now loop through, and for all rows merge the data for these fields
    for field in list_fields:
        result_obj[field] = [row[field] for row in rows]

    print('Result object:')
    pprint(result_obj)


if __name__ == '__main__':
    main()

This should have the desired result as also noted in the question:

Result object:
{'address': {'city': 'Seattle', 'number': 6910, 'street': 'Atomic Street'},
 'aka': ['Rick',
         'Grandpa',
         'Albert Ein-douche',
         'Richard',
         'Rick',
         'Grandpa',
         'Albert Ein-douche',
         'Richard'],
 'contacts': [{'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'}],
 'name': 'Rick Sanchez',
 'user_id': 1}

CodePudding user response:

Like already mentionned, JSON schema can't help you to transform your data. It can help you validate the result though.
In order to get one entry per user I think you should group your DataFrame by ["user_id", "address.city", "address.street", "address.number", "name"]. Those values should be constant for a user.
Then aggregate the remaining columns to create lists.

I created generic functions to unflatten the dictionaries and merge the list in dictionaries. You could get rid of the recursion since in your case everything is done on top level:

import json
import pandas as pd

df = pd.read_csv("file.csv", sep=", ", engine="python")
df = df.groupby(["user_id", "address.city", "address.street", "address.number", "name"], as_index=False).agg(lambda x: list(x))

#print(df) # uncomment to see the transformation

json_data = df.to_dict(orient="records")

def unflatten_dic(dic):
    for k,v in list(dic.items()):
        subkeys = k.split('.')
        if len(subkeys) > 1:
            dic.setdefault(subkeys[0],dict())
            dic[subkeys[0]].update({"".join(subkeys[1:]): v})
            unflatten_dic(dic[subkeys[0]])
            del(dic[k])


def merge_lists(dic):
    for k,v in list(dic.items()):
        if isinstance(v, dict):
            keys = list(v.keys())
            vals = list(v.values())
            if all(isinstance(l, list) and len(l)==len(vals[0]) for l in vals):
                dic[k] = []
                val_tuple = set(zip(*vals)) # removing duplicates with set()
                for t in val_tuple:
                    dic[k].append({subkey: t[i] for i, subkey in enumerate(keys)})
            else:
                merge_lists(v)
        elif isinstance(v, list):
            dic[k] = list(set(v))   # removing list duplicates
                    

for user in json_data:
    unflatten_dic(user)
    merge_lists(user)

print(json.dumps(json_data, indent=4))

Output:

[
    {
        "user_id": 1,
        "name": "Rick Sanchez",
        "aka": [
            "Richard",
            "Grandpa",
            "Albert Ein-douche",
            "Rick"
        ],
        "address": {
            "city": "Seattle",
            "street": "Atomic Street",
            "number": 6910
        },
        "contacts": [
            {
                "name": "Morty",
                "relationship": "Grandson"
            },
            {
                "name": "Beth",
                "relationship": "Daughter"
            }
        ]
    }
]
  • Related