Django 3.2.10, python 3.9
I have a QuerySet of 100 000 users. I need to serialize them and write into json file.
queryset = Users.objects.all() # 100 000 users
with open('output.json', 'w') as f:
serializer = MySerializer(queryset, many=True)
dump(serializer.data, f)
I tried a .iterator() method, but I didn't understand how it should work. Also I tried next, but it is to slow.
queryset = Users.objects.all() # 100 000 users
with open('output.json', 'w') as f:
for user in queryset:
serializer = MySerializer(user)
dump(serializer.data, f)
CodePudding user response:
I found this answer which I think will help us: https://stackoverflow.com/a/45143995/202168
Taking the code from that answer and applying it to your problem gives something like:
class StreamArray(list):
"""
Converts a generator into a list object that can be json serialisable
while still retaining the iterative nature of a generator.
i.e. it converts it to a list-like object without having to exhaust the
generator and keep its contents in memory.
"""
def __init__(self, generator):
self.generator = generator
self._len = 1
def __iter__(self):
self._len = 0
for item in self.generator:
yield item
self._len = 1
def __len__(self):
"""
Json parser looks for a this method to confirm whether or not it can
be parsed
"""
return self._len
def serialized_users(queryset):
for user in queryset:
yield MySerializer(user).data
queryset = Users.objects.all() # 100 000 users
stream_array = StreamArray(serialized_users(queryset))
with open('output.json', 'w') as f:
for chunk in json.JSONEncoder().iterencode(stream_array):
f.write(chunk)
Assuming the code from the other question works as intended this should allow us to iterate over your queryset, serialize each user one-by-one, and encode to a JSON list incrementally.
It may still be slow, but it should avoid running out of memory or similar problems.
You may still have a problem with the query timing out in your db before you finish iterating, in which case we have to adapt our solution to use Django's Paginator - so that instead of a single query returning whole queryset we iterate through a series of queries each returning next chunk of results.
We can change the serialized_users
function above to use pagination like:
from django.core.paginator import Paginator
def serialized_users(queryset):
paginator = Paginator(queryset, 1000)
for page_number in paginator.page_range:
page = paginator.page(page_number)
for user in page.object_list:
yield MySerializer(user).data
(rest of code as before)
CodePudding user response:
I think you don't need to loop over the queryset. Just set the many=True
attribute in the Userserializer
.
with open('output.json', 'w') as f:
serializer = MySerializer(queryset, many=True)
dump(serializer.data, f)