I need to consume JSON from a 3rd party API, i.e. I have to deal with whatever this API returns and can't change that.
For this specific task the API returns what it calls an "entity". Yeah, not very meaningful. The issue is the structure is deeply nested and in my parsing I want to be able to flatten it to some degree. To explain here is an obfuscated example of a single "entity". In the full response this is in a array named "data" which can have multiple entities inside.
{
"type": "entity",
"id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"links": {
"self": "https://example.com/api/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb"
},
"attributes": {
"id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"eid": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"name": "E03075-042",
"description": "",
"createdAt": "2021-07-14T05:58:47.239Z",
"editedAt": "2022-09-22T11:28:53.327Z",
"state": "open",
"fields": {
"Department": {
"value": "Foo"
},
"Description": {
"value": ""
},
"Division": {
"value": "Bar"
},
"Name": {
"value": "E03075-042"
},
"Project": {
"details": {
"description": ""
},
"value": "My Project"
}
}
},
"relationships": {
"createdBy": {
"links": {
"self": "https://example.com/api/rest/v1.0/users/101"
},
"data": {
"type": "user",
"id": "101"
}
},
"editedBy": {
"links": {
"self": "https://example.com/api/rest/v1.0/users/101"
},
"data": {
"type": "user",
"id": "101"
}
},
"ancestors": {
"links": {
"self": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/ancestors"
},
"data": [
{
"type": "entity",
"id": "7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f",
"meta": {
"links": {
"self": "https://example.com/api/rest/v1.0/entities/7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f"
}
}
}
]
},
"owner": {
"links": {
"self": "https://example.com/api/rest/v1.0/users/101"
},
"data": {
"type": "user",
"id": "101"
}
},
"pdf": {
"links": {
"self": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/pdf"
}
}
}
}
I want to parse this into a data container. I'm open to custom parsing and just using a data class over Pydantic if it is not possible what I want.
Issues with the data:
links
: Usage ofself
as field name in JSON. I would like to unnest this and have a top level field named simplylink
attributes
: unnest as well and not have them inside aAttributes
modelfields
: unnest to top level and remove/ignore duplication (name
,description
)Project
infields
: unnest to top level and only use thevalue
fieldrelationships
: unnest, ignore some and maybe even resolve to actual user name
Can I control Pydantic in such a way to unnest the data as I prefer and ignore unmapped fields?
Can the parsing also include resolving, which means more API calls?
CodePudding user response:
Pydantic provides root validators to perform validation on the entire model's data. But in this case, I am not sure this is a good idea, to do it all in one giant validation function.
I would probably go with a two-stage parsing setup. The first model should capture the "raw" data more or less in the schema you expect from the API. The second model should then reflect your own desired data schema.
That way, if you encounter an error, you can easily pinpoint, if it came from an unexpected data format from the API or because your flattening/parsing process down the line is bugged.
Following is an example.
Start off by defining base classes for inheritance/less duplication later on:
from __future__ import annotations
from datetime import datetime
from enum import Enum
from typing import Any
from pydantic import AnyHttpUrl, BaseModel, Field, root_validator, validator
from pydantic.fields import ModelField, SHAPE_LIST
class StateEnum(Enum):
open = "open"
something_else = "something_else"
class BaseAttributes(BaseModel):
id: str
eid: str
created_at: datetime = Field(alias="createdAt")
edited_at: datetime = Field(alias="editedAt")
state: StateEnum
# some fields:
name: str = Field(alias="Name")
description: str = Field(alias="Description")
class Config:
allow_population_by_field_name = True
class RawRelationship(BaseModel):
links: dict[str, AnyHttpUrl]
data: dict[str, Any] | list[dict[str, Any]] | None = None
class BaseEntity(BaseModel):
type: str
id: str
# More code below...
The state
field just screamed "choices" at me, so I went with an enum, just as an idea. I also chose to use pythonic naming conventions alongside the actual data key names as aliases.
Now you can define your RawEntity
model to capture the raw API output:
...
class RawAttributes(BaseAttributes):
fields: dict[str, Any]
class RawEntity(BaseEntity):
links: dict[str, AnyHttpUrl]
attributes: RawAttributes
relationships: dict[str, RawRelationship] = {}
@root_validator
def ensure_consistency(cls, values: dict[str, Any]) -> dict[str, Any]:
if values["id"] != values["attributes"].id:
raise ValueError("id inconsistent")
return values
# More code below...
There you have a demo of how root validation can make sense.
Finally, we can write our target model. We can give it a class method specifically for parsing a RawEntity
into a FlatEntity
, which performs a few of the flattening tasks. The field-specific ones we can delegate to validators again:
...
SELF_KEY = "self"
class FlatRelationship(BaseEntity):
link: AnyHttpUrl
class FlatEntity(BaseAttributes, BaseEntity):
link: AnyHttpUrl
# more fields:
department: str = Field(alias="Department")
division: str = Field(alias="Division")
project: str = Field(alias="Project")
# relationships:
created_by: FlatRelationship = Field(alias="createdBy")
edited_by: FlatRelationship = Field(alias="editedBy")
ancestors: list[FlatRelationship]
owner: FlatRelationship
pdf: AnyHttpUrl
@classmethod
def from_raw_entity(cls, entity: RawEntity) -> FlatEntity:
data = entity.dict(exclude={"links", "attributes", "relationships"})
data["link"] = entity.links[SELF_KEY]
data |= entity.attributes.dict(exclude={"fields"})
data |= entity.attributes.fields
data |= entity.relationships
return cls.parse_obj(data)
@validator(
"name",
"description",
"department",
"division",
"project",
pre=True,
)
def get_field_value(cls, value: object) -> object:
if isinstance(value, dict):
return value["value"]
return value
@validator("*", pre=True)
def flatten_relationship(cls, value: object, field: ModelField) -> object:
if field.type_ is not FlatRelationship:
return value
if not isinstance(value, RawRelationship):
return value
if isinstance(value.data, dict):
return FlatRelationship(**value.data, link=value.links[SELF_KEY])
if isinstance(value.data, list) and field.shape == SHAPE_LIST:
return [
FlatRelationship(**data, link=value.links[SELF_KEY])
for data in value.data
]
return value
@validator("pdf", pre=True)
def get_pdf_link(cls, value: object) -> object:
if isinstance(value, RawRelationship):
return value.links[SELF_KEY]
return value
# More code below...
As you can see, there is one validator for all the fields that are grouped under "fields" in the source data as well. There is another one for flattening the "relationships", which basically turns them into instances of BaseEntity
but with a link
field. And there is a separate one for the weird pdf
field.
Notice that each of those validators is configured as pre=True
because the data coming in will not be of the declared field type(s), so our custom validators needs to do their thing before the default Pydantic field validators kick in.
With this setup, if we put your example data into a dictionary called EXAMPLE
, we can test our parsers like this:
...
if __name__ == "__main__":
instance = RawEntity.parse_obj(EXAMPLE)
parsed = FlatEntity.from_raw_entity(instance)
print(parsed.json(indent=4))
Here is the output:
{
"type": "entity",
"id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"eid": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"created_at": "2021-07-14T05:58:47.239000 00:00",
"edited_at": "2022-09-22T11:28:53.327000 00:00",
"state": "open",
"name": "E03075-042",
"description": "",
"link": "https://example.com/api/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb",
"department": "Foo",
"division": "Bar",
"project": "My Project",
"created_by": {
"type": "user",
"id": "101",
"link": "https://example.com/api/rest/v1.0/users/101"
},
"edited_by": {
"type": "user",
"id": "101",
"link": "https://example.com/api/rest/v1.0/users/101"
},
"ancestors": [
{
"type": "entity",
"id": "7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f",
"link": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/ancestors"
}
],
"owner": {
"type": "user",
"id": "101",
"link": "https://example.com/api/rest/v1.0/users/101"
},
"pdf": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/pdf"
}
I'm sure you can further adjust/optimize this to your needs, but the general approach should be clear from this. By default, Pydantic models ignore unknown keys/field, when parsing data, so this should not be a problem.
As for your second question, I think this warrants a separate post on this site, but in general I would not perform web requests during validation.