Home > Blockchain >  How to serialize large query set and write it to json?
How to serialize large query set and write it to json?

Time:06-07

Django 3.2.10, python 3.9

I have a QuerySet of 100 000 users. I need to serialize them and write into json file.

queryset = Users.objects.all()  # 100 000 users

with open('output.json', 'w') as f:
    serializer = MySerializer(queryset, many=True)
    dump(serializer.data, f)

I tried a .iterator() method, but I didn't understand how it should work. Also I tried next, but it is to slow.

queryset = Users.objects.all()  # 100 000 users

with open('output.json', 'w') as f:
    for user in queryset:
        serializer = MySerializer(user)
        dump(serializer.data, f)

CodePudding user response:

I found this answer which I think will help us: https://stackoverflow.com/a/45143995/202168

Taking the code from that answer and applying it to your problem gives something like:

class StreamArray(list):
    """
    Converts a generator into a list object that can be json serialisable
    while still retaining the iterative nature of a generator.

    i.e. it converts it to a list-like object without having to exhaust the
    generator and keep its contents in memory.
    """
    def __init__(self, generator):
        self.generator = generator
        self._len = 1

    def __iter__(self):
        self._len = 0
        for item in self.generator:
            yield item
            self._len  = 1

    def __len__(self):
        """
        Json parser looks for a this method to confirm whether or not it can
        be parsed
        """
        return self._len


def serialized_users(queryset):
    for user in queryset:
        yield MySerializer(user).data


queryset = Users.objects.all()  # 100 000 users

stream_array = StreamArray(serialized_users(queryset))

with open('output.json', 'w') as f:
    for chunk in json.JSONEncoder().iterencode(stream_array):
        f.write(chunk)

Assuming the code from the other question works as intended this should allow us to iterate over your queryset, serialize each user one-by-one, and encode to a JSON list incrementally.

It may still be slow, but it should avoid running out of memory or similar problems.

You may still have a problem with the query timing out in your db before you finish iterating, in which case we have to adapt our solution to use Django's Paginator - so that instead of a single query returning whole queryset we iterate through a series of queries each returning next chunk of results.

We can change the serialized_users function above to use pagination like:

from django.core.paginator import Paginator

def serialized_users(queryset):
    paginator = Paginator(queryset, 1000)
    for page_number in paginator.page_range:
        page = paginator.page(page_number)
        for user in page.object_list:
            yield MySerializer(user).data

(rest of code as before)

CodePudding user response:

I think you don't need to loop over the queryset. Just set the many=True attribute in the Userserializer.

with open('output.json', 'w') as f:
    serializer = MySerializer(queryset, many=True)
    dump(serializer.data, f)
  • Related