Home > database >  Is it possible that python takes very long to assign a name to a huge list (~10 gb)?
Is it possible that python takes very long to assign a name to a huge list (~10 gb)?

Time:12-24

I am working with a very long list of lists (I'm talking 10 gb if you put it in a file), a cleaned corpus. In my script, I assign it a name, and then use it in another function that has to do with word2vec/spacy and semantic similarity (calculate for each word in a list of words what is their semantic similarity, i.e. how similar are the contexts in which these words appear). I have many steps in my script and I ask it to print something after some of the steps, all to an output file. I am using bash to execute the script. It's been 3 hours, and nothing is in my output file, which I assume means that the list has not been assigned the name yet. However, when I run a .py script with only the list in it (also assigned to a name), it takes very short. Also, the model usually loads very quickly, so that shouldn't be the problem, either. So... am I doing something wrong here? This is how I made the list (that process worked, I already have the list!) and the actual list is just a huge list of lists:

from tqdm import tqdm
import re
import nltk
import string
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
punct = string.punctuation   '«»'   '``'

def read_lines_from_big_file(path):
    with open(path, 'r', encoding='latin-1') as fp:
        for line in fp:
            if len(line) > 1:
                parts = word_tokenize(line) 
                yield parts

contexts_big = []    

            
for split_line in tqdm(read_lines_from_big_file('.../corpus.txt')):
    if 'CURRENT' not in split_line:
        clean_2 = [re.sub('\x93|\x94|\x92|l\'|un\'','',x.strip(punct).lower()) for x in split_line if re.sub('\x93|\x94|\x92|l\'|un\'','',x.strip(punct).lower()) not in stopwords.words('italian') #don't include if the word is a stopword
        and re.sub('\x93|\x94\x92|l\'|un\'','',x.strip(punct).lower()) != " " #don't include extra empty spaces
        and re.sub('\x93|\x94\x92|l\'|un\'','',x.strip(punct).lower()) not in punct #double check that all punct is removed
        and len(re.sub('\x93|\x94\x92|l\'|un\'','',x.strip(punct).lower())) > 1
        and not re.match(r'http\S |\d |\n|www\S ', re.sub('\x93|\x94\x92|l\'|un\'','',x.strip(punct).lower()))] #to remove any remaining stopwords or just random letters
        contexts_big.append(clean_2)
    else:
        continue

contexts_big = [[...],[...],[...],...]

Thanks for your help!

CodePudding user response:

When you're working with huge files, you have to think about memory management. Let's take apart this sequence:

with open("...stimoli.txt", 'r') as stimoli:
    read = stimoli.read().split("\n")
    stimoli = [x.replace(" ", "") for x in read]

First, we have stimoli.read(). That's going to read the ENTIRE FILE into memory as a single string. That statement cannot proceed until the entire file is in memory. So, there's 10GB.

Next, we have .split("\n"). That will start with that single 10GB string, will search for the newlines, and create a list with lines. This list will be ANOTHER 10GB, that has to exist in memory at the same time as the first. That's 20GB of memory altogether. That's huge.

Now, we assign that 10GB list to read, so there's an outstanding reference. Next, we do: [x.replace(" ", "") for x in read] Because of the outer brackets, that has to create ANOTHER full list in memory. That list will also be 10GB.

So, that simple set of lines has created 30GB of allocations, of which 20GB will stick around when the statement exits.

Compare to this:

with open("...stimoli.txt", 'r') as stimoli:
    stimol = [x.replace(" ", "") for x in stimoli]

That calls file.readline repeatedly, which never has to allocate more than a line at a time. It will eventually create a 10GB list of strings, but we only had to allocate 10GB, not 30GB.

CodePudding user response:

In your original code, I can see that you are repeating a set of heavy computations over and over, while you probably can do it once and reuse the result. Whet about the following schema?

for split_line in tqdm(read_lines_from_big_file('.../corpus.txt')):
    if 'CURRENT' not in split_line:
        tmp = [
            re.sub('\x93|\x94|\x92|l\'|un\'', '', x.strip(punct).lower())
            for x
            in split_line
        ]
        clean_2 = [
            x
            for x
            in tmp
            if x not in stopwords.words('italian')
                and x != " "
                and x not in punct
                and len(x) > 1
                and not re.match(r'http\S |\d |\n|www\S ', x)
        ]
        contexts_big.append(clean_2)
    else:
        continue

You perform the stripping, lowercasing and regex substitution once. Then, you validate the results. I'm not sure if I got the regex OK, as they were not identical, but so similar that I assumed it was a typo. This can be further refined, but I suspect that this change will give a noticeable speedup.

  • Related