Home > Software design >  Parsing empty strings to 0 for numeric fields when parsing with Pydantic
Parsing empty strings to 0 for numeric fields when parsing with Pydantic

Time:11-07

I have recently had to use Pydantic for parsing JSON documents, and given the nature of the project (which involves ingesting some old documents from crummy scans) it turns out that we will be implementing the module which generates the JSON by analyzing the scans, and we are also the ones who are supposed to prepare the Pydantic schema for validating and parsing the same JSON files.

Now, it so happens that in many cases, there are numeric fields in the documents which have been left blank. Since these fields are numeric, the schema must treat the fields as int. If the module that scans the document and prepares a JSON output doesn't find a particular field, of course, Pydantic will simply generate a default value (of zero) for it using pydantic.Field with a default argument. But the trouble occurs when the field is found but left blank. This is because the parse_raw method will attempt to parse the field, find an empty string "" in it, and raise a ValidationError.

Of course, an easy solution is for the analysis module to make sure that all numeric fields are mapped to 0 if they are empty. But this will require the analysis module to be aware of the fields in the input, to know which of them are numeric, and to map them to "0" from "".

While this is not inherently a problem, I would prefer that this task be automated by pydantic. For one thing, if we are already generating a Pydantic schema with information about the nature of the fields in the second module of the pipeline, injecting datatype information into the first module as well becomes redundant. For another, the first module is already a heavy CV unit with a massive amount of code, so adding more features into it and bloating it further is just not what we want to do.

I mean, if there is an automated parser with a schema, it makes sense that this parser should be able to do some elementary mapping. It would be nice if the parser were able to map all instances of empty string "" to instances of zero "0" for us without us having to worry. This is the functionality we are looking for.

Consider the JSON file:

{
  "a": ""
}

Now consider the class:

class A(BaseModel):
  a: int = ...

If I call A.parse_file and give the method this file I have described above, is there anything I can put inside the region marked by the ellipsis in the Python code so that instead of raising an exception, the method returns an object with {'a': 0} as its __dict__ dunder?

I have looked through pydantic.Field, but I could not find anything.

CodePudding user response:

It was a little hard to understand, what your actual setup is, but if I understood correctly, you define your own models. In that case, you just need to write a custom validator with the pre=True setting to turn any empty string into a numeric zero for fields with numeric types.

If you want a validator that is more or less universal, you can even set it to each_item=True. Then it will work with fields representing collections of numbers, such as list[float] as well.

Here is a full working example:

from numbers import Number
from typing import Any

from pydantic import BaseModel, validator
from pydantic.fields import ModelField

class Foo(BaseModel):
    a: int
    b: float
    c: complex
    d: list[float]
    e: str

    @validator("*", pre=True, each_item=True)
    def empty_to_zero(cls, v: Any, field: ModelField) -> Any:
        if issubclass(field.type_, Number) and v == "":
            return field.type_(0)
        return v

    class Config:
        arbitrary_types_allowed = True  # only needed for `complex`

if __name__ == "__main__":
    data = {
        "a": "",
        "b": "",
        "c": "",
        "d": ["3.14", ""],
        "e": "Hi mom",
    }
    foo = Foo.parse_obj(data)
    print(foo)
    print(foo.dict())

Output:

a=0 b=0.0 c=0j d=[3.14, 0.0] e='Hi mom'
{'a': 0, 'b': 0.0, 'c': 0j, 'd': [3.14, 0.0], 'e': 'Hi mom'}

This rests on the assumption that any reasonable Number subclass will initialize properly with a 0 passed to the constructor, which I don't think is a big ask.

  • Related