Optimizing a .txt extraction function-CodePudding

I am trying to optimize a function to import .txt files for data analysis. I have a somewhat large corpus, but given that the function only reads the documents and creates the Data Frame with each paragraph as an element I think it is taking WAY too long, around 30min for 17 documents with about 1000 pages each.

Any suggestions on how to make this faster? I only have to load it once, but it's annoying to loose half an hour to load the data.

def read_docs_paragraph(textfolder):
"""
This function reads all the files in a folder and returns a dataframe with the content of the files chunked by paragraphs
(if .txt file is organized by paragraphs) and the name of the file.

Parameters
----------
textfolder : str
    The path of the folder where the files are located.
    
Returns
-------
df : DataFrame
    A dataframe with the content of the files and the name of the file.
"""

df = pd.DataFrame()
df['Corpus'] = ''
df['Estado'] = ''

#Iterate
for filename in os.listdir(textfolder):
    #Opens the file you specified
    if filename.endswith('.txt'):
        with open(filename, 'r', encoding='utf8') as f:
            for line in f.readlines():
                df_length = len(df)
                df.loc[df_length] = line
                df['Estado'].loc[df_length] = filename
return df

CodePudding user response：

Here's a suggestion with Path and fileinput, both from the standard library:

from pathlib import Path
import fileinput

def read_docs_paragraph(textfolder):
    with fileinput.input(Path(textfolder).glob("*.txt")) as files:
        return pd.DataFrame(
            ([line, files.filename()] for line in files),
            columns = ["Corpus", "Estado"]
        )

I've timed it a bit and it seems like 700 times faster (that might depend on the files, machine, etc., though).

As pointed out by @FranciscoMelloCastro: If you have to be explicit about the encoding you could use openhook=fileinput.hook_encoded("utf-8"), or starting with Python 3.10 encoding="utf-8".