I need to import into a Pandas Datafreame a text file with 1M lines and then split it by position. Here is an exaple of what the data looks like:
Corno 616600000071835014010820220108202201201082022100000054700000020000000000 0000000000192 010 150 0100090076002000220000004038 00000000000000002934 EUR00000000000293 EUR00000000003227 EUR000000000000 KG
What I'm doing right now is to decode the line and then splitting it using the pythons list:
with open('temp.txt', 'r') as f:
for line in f:
dec = line.decode("cp1252")
temp = [dec[0:6], dec[6:12], dec[12:14], dec[14:22], dec[22:30], dec[30:38], dec[38:39],
dec[39:41], dec[41:49], dec[49:52], dec[52:60], dec[60:62], dec[62:70], dec[70:82],
dec[82:84], dec[84:92], dec[92:104], dec[104:108], dec[108:126], dec[126:144],
dec[144:152], dec[152:153], dec[153:213], negozio]
data_t.append(temp)
and then creating a dataframe with the list:
df_total_testate = pd.DataFrame(data_t,
columns=['Column1', 'Column2', 'Column3', 'Column4',
'Column5', 'Column6', 'Column7',
'Column8', 'Column9', 'Column10',
'Column11', 'Column12', 'Column13',
'Column14', 'Column15',
'Column16', 'Column17',
'Column18', 'Column19',
'Column20', 'Column21',
'Column22', 'Column23', 'Column24'])
But this way it takes over 20 mins.
Is there any way to make it faster?
EDIT: I updated the code for more details
CodePudding user response:
You should use pandas builtin functions as they are highly optimised. Also, you should avoid pd.DataFrame.append
. As mentioned in the doc:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
My advice is thus: either load the data with pd.read_csv(filename, sep="\s ", names=[f"Column{i}" for i in range(1, 25)])
. You might also try other delimiters instead of \s
or pd.read_fwf
as mentioned by Алексей Р in the comments.