How to convert multiple data sources into a predefined data structure in python?-CodePudding

I am creating a parser that takes data from multiple sources with multiple data schema and then converts them to a standardized structured schema.

For example, I have 2 sources of data:

Source 1:

{
  "students: [{
     "id": 129939,
     "name": "Alice",
     "gender": "female",
  }]
}

Source 2:

{
  "students: [{
     "id": 129939,
     "fullname": "Alice",
     "sex": "female",
  }]
}

Both sources of data can be converted into standardized structured data that I already defined:

class Student:
   id: int
   name: str
   gender: str

Do you know is there any existing library that supports defines the schema for each input data source and then allows to map each field of the input source to the wanted data structure?

For example, it can be a mapper like this:

class Input1toStudentMapper:
   id -> Student.id
   name -> Student.name
   gender -> Student.gender

class Input2toStudentMapper:
   id -> Student.id
   fullname -> Student.name
   sex -> Student.gender

Any suggestion will be appreciated.

CodePudding user response：

I don't know of any library, but for the problem you are describing - you can probably code the logic into the __init__ of the class itself -

class Student:
    name_field_variations = ['name', 'fullname']
    sex_field_variations = ['gender', 'sex']
    def __init__(self, **kwargs):
        self.id = kwargs['id']
        _name_field = set(kwargs.keys()) & set(Student.name_field_variations)
        self.name = kwargs[_name_field.pop()]
        _sex_field = set(kwargs.keys()) & set(Student.sex_field_variations)
        self.gender = kwargs[_sex_field.pop()]

print(js1) # {'students': [{'id': 129939, 'name': 'Alice', 'gender': 'female'}]}
print(js2) # {'students': [{'id': 129939, 'fullname': 'Alice', 'sex': 'female'}]}
s1 = Student(**js1['students'][0])
s2 = Student(**js2['students'][0])
print(s1.gender) # female
print(s2.gender) # female

CodePudding user response：

I would check out the dataclass-wizard library for this. It plays well with the built in dataclasses module in Python. It supports multiple aliases (or key mappings) for each field, as well as a one-way alias which we might want in this case - ex. if we want to allow an additional mapping of fullname to the name field, but serialize using the default key name.

In the below example I also added a __future__ import which is supported in Python 3.7 or higher. This is mainly to support the use of forward references; without this future import, we will need to explicitly define forward references, for ex. like List['Student'].

from __future__ import annotations

from dataclasses import dataclass
from typing import List

from dataclass_wizard import json_key
from typing_extensions import Annotated


@dataclass
class Container:
    students: List[Student]


@dataclass
class Student:
    id: int
    name: Annotated[str, json_key('fullname')]
    gender: Annotated[str, json_key('sex')]

In the annotation above, declaring it like json_key('fullname') is a shorthand for json_key('name', 'fullname', all=True), which is also how multiple aliases could be mapped to a field.

Note that if you only plan to support Python 3.9 or above, you can make the below changes:

Import Annotated from the typing module instead
Remove the from typing import List import and define the annotation like list[Student]

Here is sample usage for testing out the above code:

if __name__ == '__main__':
    from dataclass_wizard import asdict, fromdict, fromlist

    source_1 = {
        "students": [{
            "id": 129939,
            "name": "Alice",
            "gender": "female",
        }]
    }

    source_2 = {
        "students": [{
            "id": 129940,
            "fullname": "Johnny",
            "sex": "male",
        }]
    }

    c1 = fromdict(Container, source_1)
    print(c1)
    # Container(students=[Student(id=129939, name='Alice', gender='female')])

    c2 = fromdict(Container, source_2)
    print(c2)
    # Container(students=[Student(id=129940, name='Johnny', gender='male')])

    # alternatively, if you just need a list of the `Student` instances:
    students = fromlist(Student, source_1['students'])
    print(students)
    # [Student(id=129939, name='Alice', gender='female')]

    # assert we get the same data when serializing the Container instance as a
    # Python dict object.
    serialized_dict = asdict(c1)
    assert serialized_dict == source_1