I am trying to optimize a function to import .txt files for data analysis. I have a somewhat large corpus, but given that the function only reads the documents and creates the Data Frame with each paragraph as an element I think it is taking WAY too long, around 30min for 17 documents with about 1000 pages each.
Any suggestions on how to make this faster? I only have to load it once, but it's annoying to loose half an hour to load the data.
def read_docs_paragraph(textfolder):
"""
This function reads all the files in a folder and returns a dataframe with the content of the files chunked by paragraphs
(if .txt file is organized by paragraphs) and the name of the file.
Parameters
----------
textfolder : str
The path of the folder where the files are located.
Returns
-------
df : DataFrame
A dataframe with the content of the files and the name of the file.
"""
df = pd.DataFrame()
df['Corpus'] = ''
df['Estado'] = ''
#Iterate
for filename in os.listdir(textfolder):
#Opens the file you specified
if filename.endswith('.txt'):
with open(filename, 'r', encoding='utf8') as f:
for line in f.readlines():
df_length = len(df)
df.loc[df_length] = line
df['Estado'].loc[df_length] = filename
return df
CodePudding user response:
Here's a suggestion with Path
and fileinput
, both from the standard library:
from pathlib import Path
import fileinput
def read_docs_paragraph(textfolder):
with fileinput.input(Path(textfolder).glob("*.txt")) as files:
return pd.DataFrame(
([line, files.filename()] for line in files),
columns = ["Corpus", "Estado"]
)
I've timed it a bit and it seems like 700 times faster (that might depend on the files, machine, etc., though).
As pointed out by @FranciscoMelloCastro: If you have to be explicit about the encoding you could use openhook=fileinput.hook_encoded("utf-8")
, or starting with Python 3.10 encoding="utf-8"
.