I am creating a parser that takes data from multiple sources with multiple data schema and then converts them to a standardized structured schema.
For example, I have 2 sources of data:
Source 1:
{
"students: [{
"id": 129939,
"name": "Alice",
"gender": "female",
}]
}
Source 2:
{
"students: [{
"id": 129939,
"fullname": "Alice",
"sex": "female",
}]
}
Both sources of data can be converted into standardized structured data that I already defined:
class Student:
id: int
name: str
gender: str
Do you know is there any existing library that supports defines the schema for each input data source and then allows to map each field of the input source to the wanted data structure?
For example, it can be a mapper like this:
class Input1toStudentMapper:
id -> Student.id
name -> Student.name
gender -> Student.gender
class Input2toStudentMapper:
id -> Student.id
fullname -> Student.name
sex -> Student.gender
Any suggestion will be appreciated.
CodePudding user response:
I don't know of any library, but for the problem you are describing - you can probably code the logic into the __init__
of the class itself -
class Student:
name_field_variations = ['name', 'fullname']
sex_field_variations = ['gender', 'sex']
def __init__(self, **kwargs):
self.id = kwargs['id']
_name_field = set(kwargs.keys()) & set(Student.name_field_variations)
self.name = kwargs[_name_field.pop()]
_sex_field = set(kwargs.keys()) & set(Student.sex_field_variations)
self.gender = kwargs[_sex_field.pop()]
print(js1) # {'students': [{'id': 129939, 'name': 'Alice', 'gender': 'female'}]}
print(js2) # {'students': [{'id': 129939, 'fullname': 'Alice', 'sex': 'female'}]}
s1 = Student(**js1['students'][0])
s2 = Student(**js2['students'][0])
print(s1.gender) # female
print(s2.gender) # female
CodePudding user response:
I would check out the dataclass-wizard library for this. It plays well with the built in dataclasses module in Python. It supports multiple aliases (or key mappings) for each field, as well as a one-way alias which we might want in this case - ex. if we want to allow an additional mapping of fullname
to the name
field, but serialize using the default key name
.
In the below example I also added a __future__
import which is supported in Python 3.7 or higher. This is mainly to support the use of forward references; without this future import, we will need to explicitly define forward references, for ex. like List['Student']
.
from __future__ import annotations
from dataclasses import dataclass
from typing import List
from dataclass_wizard import json_key
from typing_extensions import Annotated
@dataclass
class Container:
students: List[Student]
@dataclass
class Student:
id: int
name: Annotated[str, json_key('fullname')]
gender: Annotated[str, json_key('sex')]
In the annotation above, declaring it like json_key('fullname')
is a shorthand for json_key('name', 'fullname', all=True)
, which is also how multiple aliases could be mapped to a field.
Note that if you only plan to support Python 3.9 or above, you can make the below changes:
- Import
Annotated
from thetyping
module instead - Remove the
from typing import List
import and define the annotation likelist[Student]
Here is sample usage for testing out the above code:
if __name__ == '__main__':
from dataclass_wizard import asdict, fromdict, fromlist
source_1 = {
"students": [{
"id": 129939,
"name": "Alice",
"gender": "female",
}]
}
source_2 = {
"students": [{
"id": 129940,
"fullname": "Johnny",
"sex": "male",
}]
}
c1 = fromdict(Container, source_1)
print(c1)
# Container(students=[Student(id=129939, name='Alice', gender='female')])
c2 = fromdict(Container, source_2)
print(c2)
# Container(students=[Student(id=129940, name='Johnny', gender='male')])
# alternatively, if you just need a list of the `Student` instances:
students = fromlist(Student, source_1['students'])
print(students)
# [Student(id=129939, name='Alice', gender='female')]
# assert we get the same data when serializing the Container instance as a
# Python dict object.
serialized_dict = asdict(c1)
assert serialized_dict == source_1