Avoid huge memory usage in Python when substituting elements in large list of strings-CodePudding

so I'm trying to parse a large list of strings with comma-separated values into (a list of) lists. I'm trying to do this "in place", such as not to duplicate an already large object in memory.

Now, ideally, during and after parsing, the only additional memory required would be the overhead of representing the original strings as lists of strings. But what actually happens is much, much worse.

E.g. this list of strings occupies ~1.36 GB of memory:

import psutil

l = [f"[23873498uh3149ubn34, 59ubn23459un3459, un3459-un345, 9u3n45iu9n345, {i}]" for i in range(10_000_000)]
psutil.Process().memory_info().rss / 1024**3

>> 1.3626747131347656

The desired end result of the parsing would take up somewhat more (~1.8 GB):

import psutil

l = [["23873498uh3149ubn34", "59ubn23459un3459", "un3459-un345", "9u3n45iu9n345", str(i)] for i in range(10_000_000)]
psutil.Process().memory_info().rss / 1024**3

1.7964096069335938

However, actually doing the parsing of the original strings requires a whopping 5 GB of memory, i.e. much more even than the initial strings plus the final lists together:

import psutil

l = [f"[23873498uh3149ubn34, 59ubn23459un3459, un3459-un345, 9u3n45iu9n345, {i}]" for i in range(10_000_000)]

for i, val in enumerate(l):
    l[i] = val.split(", ")
    
psutil.Process().memory_info().rss / 1024**3

4.988628387451172

Now, I understand that pure Python strings and lists themselves are not terribly efficient (memory-wise), but I fail to understand that huge, huge gap between the memory required for the final result (1.8GB), and what's used in the process to get there (5GB).

Can anyone explain what exactly is going on, whether it is possible to actually modify lists "in place" (while actually freeing the memory of replaced values), and whether there is a better in-memory way to do this?

CodePudding user response：

Reformatting so it's readable, your estimate of the memory it "should take" is wildly optimistic:

l = [["23873498uh3149ubn34",
      "59ubn23459un3459",
      "un3459-un345",
      "9u3n45iu9n345",
      str(i)] for i in range(10_000_000)]

That's way off base. For example, there is only one string object created for "9u3n45iu9n345", which is shared across the 10 million lists.

>>> print(l[10][2] is l[56000][2]) # first indices are arbitrary
True

Your actual code creates 10 million distinct string objects at each index position.

Try running

sys._debugmallocstats()

from time to time to get more clues about what the memory is being used for.

Here under 3.10.1, on 64-bit Windows, some summary output at the end shows that allocated RAM is being used very efficiently (nearly all allocated blocks are in active use):

# arenas allocated total           =                6,601
# arenas reclaimed                 =                1,688
# arenas highwater mark            =                4,913
# arenas allocated current         =                4,913
4913 arenas * 1048576 bytes/arena  =        5,151,653,888

# bytes in allocated blocks        =        5,129,947,088
# bytes in available blocks        =              701,584
53 unused pools * 16384 bytes      =              868,352
# bytes lost to pool headers       =           15,090,192
# bytes lost to quantization       =            5,046,672
# bytes lost to arena alignment    =                    0
Total                              =        5,151,653,888

arena map counts
# arena map mid nodes              =                    1
# arena map bot nodes              =                   20

# bytes lost to arena map root     =                8,192
# bytes lost to arena map mid      =                8,192
# bytes lost to arena map bot      =               40,960
Total                              =               57,344

Note: an easy way to save about 800 million bytes:

    l[i] = tuple(val.split(", ")) # use this
    # l[i] = val.split(", ")      # instead of this

Lists "overallocate" so that a sequence of appends runs in amortized O(1) time. Tuples don't - they allocate only as much as needed to hold their initial content (which is also their final content, since tuples are immutable).

CodePudding user response：

If you are not consuming entire list at once, You can use generator expression

import psutil

data = (f"[23873498uh3149ubn34, 59ubn23459un3459, un3459-un345, 9u3n45iu9n345, {i}]" for i in range(10_000_000))
res = (x.split(', ') for x in data)
    
print(psutil.Process().memory_info().rss / 1024**3) # 0.08874893188476562

CodePudding user response：

Measuring RSS is wrong. RSS is that part of virtual memory that currently resides in physical RAM. If, for example, I have only 4 GB of physical RAM, I'll never measure more than 4 GB. So: if you want to keep that number small, remove RAM modules from your computer.

The data in the strings is about 650 MB.

A list needs 56 bytes in memory. You are creating 10.000.000 of them, so the lists account for 560 MB of additional memory.

A string takes 49 bytes in memory, plus the characters themselves. After splitting the long string into pieces, you have 50.000.000 strings, so they account for another 2450 MB.

Another problem is garbage collection. With PSUtils, you're asking the operating system about how much memory it has given to Python. However, Python might just have requested it, because it recognized that you're dealing with a lot of data. However, some of it can still be considered to be free internally for Python.

Other than that, I claim you should do real world tests. A good CSV parser will not load the entire file into memory and then split the string. Instead it will have a parser that can stream the data and split at a comma whenever it finds one.

To get the overhead for the classes, I used

import sys
print(sys.getsizeof([]))
print(sys.getsizeof(""))