Parsing and flattening complex JSON with Pydantic-CodePudding

I need to consume JSON from a 3rd party API, i.e. I have to deal with whatever this API returns and can't change that.

For this specific task the API returns what it calls an "entity". Yeah, not very meaningful. The issue is the structure is deeply nested and in my parsing I want to be able to flatten it to some degree. To explain here is an obfuscated example of a single "entity". In the full response this is in a array named "data" which can have multiple entities inside.

{
  "type": "entity",
  "id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
  "links": {
    "self": "https://example.com/api/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb"
  },
  "attributes": {
    "id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
    "eid": "efebcc3e-445c-4d85-9689-bb85f46160cb",
    "name": "E03075-042",
    "description": "",
    "createdAt": "2021-07-14T05:58:47.239Z",
    "editedAt": "2022-09-22T11:28:53.327Z",
    "state": "open",
    "fields": {
      "Department": {
        "value": "Foo"
      },
      "Description": {
        "value": ""
      },
      "Division": {
        "value": "Bar"
      },
      "Name": {
        "value": "E03075-042"
      },
      "Project": {
        "details": {
          "description": ""
        },
        "value": "My Project"
      }
    }
  },
  "relationships": {
    "createdBy": {
      "links": {
        "self": "https://example.com/api/rest/v1.0/users/101"
      },
      "data": {
        "type": "user",
        "id": "101"
      }
    },
    "editedBy": {
      "links": {
        "self": "https://example.com/api/rest/v1.0/users/101"
      },
      "data": {
        "type": "user",
        "id": "101"
      }
    },
    "ancestors": {
      "links": {
        "self": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/ancestors"
      },
      "data": [
        {
          "type": "entity",
          "id": "7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f",
          "meta": {
            "links": {
              "self": "https://example.com/api/rest/v1.0/entities/7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f"
            }
          }
        }
      ]
    },
    "owner": {
      "links": {
        "self": "https://example.com/api/rest/v1.0/users/101"
      },
      "data": {
        "type": "user",
        "id": "101"
      }
    },
    "pdf": {
      "links": {
        "self": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/pdf"
      }
    }
  }
}

I want to parse this into a data container. I'm open to custom parsing and just using a data class over Pydantic if it is not possible what I want.

Issues with the data:

links: Usage of self as field name in JSON. I would like to unnest this and have a top level field named simply link
attributes: unnest as well and not have them inside a Attributes model
fields: unnest to top level and remove/ignore duplication (name, description)
Project in fields: unnest to top level and only use the value field
relationships: unnest, ignore some and maybe even resolve to actual user name

Can I control Pydantic in such a way to unnest the data as I prefer and ignore unmapped fields?

Can the parsing also include resolving, which means more API calls?

CodePudding user response：

Pydantic provides root validators to perform validation on the entire model's data. But in this case, I am not sure this is a good idea, to do it all in one giant validation function.

I would probably go with a two-stage parsing setup. The first model should capture the "raw" data more or less in the schema you expect from the API. The second model should then reflect your own desired data schema.

That way, if you encounter an error, you can easily pinpoint, if it came from an unexpected data format from the API or because your flattening/parsing process down the line is bugged.

Following is an example.

Start off by defining base classes for inheritance/less duplication later on:

from __future__ import annotations
from datetime import datetime
from enum import Enum
from typing import Any

from pydantic import AnyHttpUrl, BaseModel, Field, root_validator, validator
from pydantic.fields import ModelField, SHAPE_LIST

class StateEnum(Enum):
    open = "open"
    something_else = "something_else"

class BaseAttributes(BaseModel):
    id: str
    eid: str
    created_at: datetime = Field(alias="createdAt")
    edited_at: datetime = Field(alias="editedAt")
    state: StateEnum
    # some fields:
    name: str = Field(alias="Name")
    description: str = Field(alias="Description")

    class Config:
        allow_population_by_field_name = True

class RawRelationship(BaseModel):
    links: dict[str, AnyHttpUrl]
    data: dict[str, Any] | list[dict[str, Any]] | None = None

class BaseEntity(BaseModel):
    type: str
    id: str

# More code below...

The state field just screamed "choices" at me, so I went with an enum, just as an idea. I also chose to use pythonic naming conventions alongside the actual data key names as aliases.

Now you can define your RawEntity model to capture the raw API output:

...

class RawAttributes(BaseAttributes):
    fields: dict[str, Any]

class RawEntity(BaseEntity):
    links: dict[str, AnyHttpUrl]
    attributes: RawAttributes
    relationships: dict[str, RawRelationship] = {}

    @root_validator
    def ensure_consistency(cls, values: dict[str, Any]) -> dict[str, Any]:
        if values["id"] != values["attributes"].id:
            raise ValueError("id inconsistent")
        return values

# More code below...

There you have a demo of how root validation can make sense.

Finally, we can write our target model. We can give it a class method specifically for parsing a RawEntity into a FlatEntity, which performs a few of the flattening tasks. The field-specific ones we can delegate to validators again:

...

SELF_KEY = "self"

class FlatRelationship(BaseEntity):
    link: AnyHttpUrl

class FlatEntity(BaseAttributes, BaseEntity):
    link: AnyHttpUrl
    # more fields:
    department: str = Field(alias="Department")
    division: str = Field(alias="Division")
    project: str = Field(alias="Project")
    # relationships:
    created_by: FlatRelationship = Field(alias="createdBy")
    edited_by: FlatRelationship = Field(alias="editedBy")
    ancestors: list[FlatRelationship]
    owner: FlatRelationship
    pdf: AnyHttpUrl

    @classmethod
    def from_raw_entity(cls, entity: RawEntity) -> FlatEntity:
        data = entity.dict(exclude={"links", "attributes", "relationships"})
        data["link"] = entity.links[SELF_KEY]
        data |= entity.attributes.dict(exclude={"fields"})
        data |= entity.attributes.fields
        data |= entity.relationships
        return cls.parse_obj(data)

    @validator(
        "name",
        "description",
        "department",
        "division",
        "project",
        pre=True,
    )
    def get_field_value(cls, value: object) -> object:
        if isinstance(value, dict):
            return value["value"]
        return value

    @validator("*", pre=True)
    def flatten_relationship(cls, value: object, field: ModelField) -> object:
        if field.type_ is not FlatRelationship:
            return value
        if not isinstance(value, RawRelationship):
            return value
        if isinstance(value.data, dict):
            return FlatRelationship(**value.data, link=value.links[SELF_KEY])
        if isinstance(value.data, list) and field.shape == SHAPE_LIST:
            return [
                FlatRelationship(**data, link=value.links[SELF_KEY])
                for data in value.data
            ]
        return value

    @validator("pdf", pre=True)
    def get_pdf_link(cls, value: object) -> object:
        if isinstance(value, RawRelationship):
            return value.links[SELF_KEY]
        return value

# More code below...

As you can see, there is one validator for all the fields that are grouped under "fields" in the source data as well. There is another one for flattening the "relationships", which basically turns them into instances of BaseEntity but with a link field. And there is a separate one for the weird pdf field.

Notice that each of those validators is configured as pre=True because the data coming in will not be of the declared field type(s), so our custom validators needs to do their thing before the default Pydantic field validators kick in.

With this setup, if we put your example data into a dictionary called EXAMPLE, we can test our parsers like this:

...

if __name__ == "__main__":
    instance = RawEntity.parse_obj(EXAMPLE)
    parsed = FlatEntity.from_raw_entity(instance)
    print(parsed.json(indent=4))

Here is the output:

{
    "type": "entity",
    "id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
    "eid": "efebcc3e-445c-4d85-9689-bb85f46160cb",
    "created_at": "2021-07-14T05:58:47.239000 00:00",
    "edited_at": "2022-09-22T11:28:53.327000 00:00",
    "state": "open",
    "name": "E03075-042",
    "description": "",
    "link": "https://example.com/api/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb",
    "department": "Foo",
    "division": "Bar",
    "project": "My Project",
    "created_by": {
        "type": "user",
        "id": "101",
        "link": "https://example.com/api/rest/v1.0/users/101"
    },
    "edited_by": {
        "type": "user",
        "id": "101",
        "link": "https://example.com/api/rest/v1.0/users/101"
    },
    "ancestors": [
        {
            "type": "entity",
            "id": "7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f",
            "link": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/ancestors"
        }
    ],
    "owner": {
        "type": "user",
        "id": "101",
        "link": "https://example.com/api/rest/v1.0/users/101"
    },
    "pdf": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/pdf"
}

I'm sure you can further adjust/optimize this to your needs, but the general approach should be clear from this. By default, Pydantic models ignore unknown keys/field, when parsing data, so this should not be a problem.

As for your second question, I think this warrants a separate post on this site, but in general I would not perform web requests during validation.