Home > Mobile >  Memory usage of a list of millions of strings in Python
Memory usage of a list of millions of strings in Python

Time:02-23

As seen in Find the memory size of a set of strings vs. set of bytestrings, it's difficult to precisely measure the memory used by a set or list containing strings. But here is a good estimation/upper bound:

import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a	i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10])  # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes

i.e. 569 MB for 100 MB of actual data.

Solutions to improve this (for example with other data structures) have been found in Memory-efficient data structure for a set of short bytes-strings and Set of 10-char strings in Python is 10 times bigger in RAM as expected, so here my question is not "how to improve", but:

How can we precisely explain this size in the case of a standard list of byte-string?

How many bytes for each byte-string, for each (linked?) list item to finally obtain 569 MB?

This will help to understand the internals of lists and bytes-strings in CPython (platform: Windows 64 bit).

CodePudding user response:

Summary:

  • 89 MB for the list object
  • 480 MB for the string objects
  • => total 569 MB

sys.getsizeof(L) will tell you the list object itself is about 89 MB. That's a few dozen organizational bytes, 8 bytes per bytestring reference, and up to 12.5% overallocation to allow efficient insertions.

sys.getsizeof(one_of_your_bytestrings) will tell you they're 43 bytes each. That's:

  • 8 bytes for the reference counter
  • 8 bytes for the pointer to the type
  • 8 bytes for the length (since bytestrings aren't fixed size)
  • 8 bytes hash
  • 10 bytes for your actual bytestring content
  • 1 byte for a terminating 0-byte.

Storing the objects every 43 bytes in memory would cross memory word boundaries, which is slower. So they're actually stored usually every 48 bytes. You can use id(one_of_your_bytestrings) to get the addresses to check.

(There's some variance here and there, partly due to the exact memory allocations that happen, but 569 MB is about what's expected knowing the above reasons, and it matches what you measured.)

  • Related