I have a hypothetical question regarding the memory usage of lists in python. I have a long list my_list
that consumes multiple gigabytes if it is loaded into memory. I want to loop over that list and use each element only once during the iteration, meaning I could delete them from the list after looping over them. While I am looping, I am storing something else in memory, meaning the memory I allocated for my_list
is now needed for something else. Thus, ideally, I would like to delete the list elements and free the memory while I am looping over them.
I assume, in most cases, a generator would make the most sense here. I could dump my list to a csv file and then read it line by line in a for-loop. In that case, my_list
would never be loaded into memory in the first place. However, let's assume for the sake of discussion I don't want to do that.
Is there a way of releasing the memory of a list as I loop over it? The following does NOT work:
>>> my_list = [1,2,3]
>>> sys.getsizeof(my_list)
80
>>> my_list.pop()
>>> sys.getsizeof(my_list)
80
or
>>> my_list = [1,2,3]
>>> sys.getsizeof(my_list)
80
>>> del my_list[-1]
>>> sys.getsizeof(my_list)
80
even when gc.collect()
is called explicitly.
The only way that I get to work is copying the array (which at the time of copying would require 2x the memory and thus is again a problem):
>>> my_list = [1,2,3]
>>> sys.getsizeof(my_list)
80
>>> my_list.pop()
>>> my_list_copy = my_list.copy()
>>> sys.getsizeof(my_list_copy)
72
The fact that I don't find information on this topic indicates to me that probably the approach is either impossible or bad practice. If it should not be done this way, what would be the best alternative? Loading from csv as a generator? Or are there even better ways of doing this?
EDIT: as @Scott Hunter pointed out, the garbage collector works for much larger lists:
>>> my_list = [1] * 10**9
>>> for i in range(10):
... for j in range(10**8):
... del my_list[-1]
... gc.collect()
... print(sys.getsizeof(my_list))
Prints the following:
8000000056
8000000056
8000000056
8000000056
8000000056
4500000088
4500000088
2531250112
1423828240
56
CodePudding user response:
Many of your assumptions here are incorrect.
First biggie is the assumption that you can delete items as you loop over them with a
for
loop. You can't. You could with awhile
loop of the form:while my_list: item=my_list.pop(0) process(item) # Each my_list[0] element ref_count-- each loop... # If the ref_count==0, item is garbage collected automatically # But this is very slow with a Python list. # Use a collections.deque instead
The garbage collector in Python is virtually seamless and automatic. It is rare to ever have to call the garbage collector from your program with programs that are written in common programming patterns. I can't think of anytime that it was needed in my use.
To answer one of your questions, if you call
.pop
on a list, the object pop'ed off the list has it reference count decreased. If it reaches zero, that object if garbage collected automatically -- no need to callgc.collect()
.The main use of calling
gc.collect()
is to delete self referring items that may have confused the default garbage collector. (ie,a = []; a.append(a); del a
. In this case, the ref count fora
never reaches 0 and is not deleted. Here is an example.)Python (depending in implementation) allocates and frees memory in blocks far larger than individual objects that are usually used in lists. Each
.append
or.pop
to/from a list either goes to shared heap or then to a new allocation or release. See here.If you try and do per item memory management -- unless each item is huge -- it will be far less efficient than Python's automatic memory management.
You state that you are reading a large list from a file and then using those items for something else. So long as the items are not mutated -- usually this does not result in a new copy of the item. It results in the reference count of each item being increased.
If you read a large list from a file, mutate each item and keep a copy, then indeed your memory use goes up. However, your first list is automatically deleted when it goes out of scope.
All these issues are moot if you process the file line-by-line or use a generator of some sort to do so.
Here is an example. Assume you want to take a big text file and 1) read every line; 2) change "_" to ":" in every line; 3) have a list with every line so processed.
Given this 1.3 GB, 100,000,000 line file:
# Create the file with awk
% awk -v cnt=100000000 'BEGIN{ for (i=1; i<=cnt; i ) print "Line_" i }' >file
% ls -lh file
-rw-r--r-- 1 dawg wheel 1.3G Nov 14 12:34 file
% printf "%s\n...\n%s\n" $(head -n 1 file) $(tail -n 1 file)
Line_1
...
Line_100000000
You can process that file several different ways.
First is as you as thinking about:
from collections import deque
# If you don't use a deque, this is too SLOW with a Python list
def process(dq):
rtr=deque()
while dq:
line=dq.popleft()
line=line.rstrip().replace("_", ":")
rtr.append(line)
return rtr
with open('/tmp/file') as f:
dq=deque(f.readlines())
new_dq=process(dq)
print(new_dq[-1])
The second is with a line by line generator:
def process(line):
return line.rstrip().replace("_", ":")
with open('/tmp/file') as f:
new_li=list(process(line) for line in f)
print(new_li[-1])
The second method is a) Faster; b) less memory and c) easier and more Pythonic to write.
Don't overthink trying to manage memory. Python is a lot easier than C.