Is there a faster way to create a df from a txt file?-CodePudding

I have a .txt file with lines such as "G1 X174.774 Y46.362 E1.48236", "M73 Q1 S245", all with one letter then a number and then a space. I'm trying to create a dataframe such that each row is a line from my file and each column is a letter. If my file were just the two lines above, my resulting dataframe would be

G X       Y      E       M  Q S
1 174.774 46.362 1.48236 0  0 0
0 0       0      0       73 1 245

So far I have a dataframe with the columns of all possible letters in the .txt file, and the .txt file is now represented as a list of strings representing each line of the file. As of now I can only figure out how to add each line individually to the df with the following for loop:

for j in tqdm(range(len(lines))):
            line = lines[j]
            points = line.split()
            k = [x[0] for x in points]
            v = [x[1:] for x in points]
            line_dict = dict(zip(k, v))
            df.loc[j] = pd.Series(line_dict)

This gives me my desired result (the unspecified values are NaN, but I can change these to zero later), but as my files have 200k lines, it's taking about an hour per file. Is there a faster way I could do this? I've been trying to think of a way to use list comprehension, but using the dict is confusing me a bit, and I'm not sure how much faster that would make things anyway. I haven't been able to find much on stackoverflow about this subject, but if I missed something please feel free to share the link with me! Thanks!

CodePudding user response：

Yes, I suspect there is. Do not incrementally increase the number of rows in a dataframe in a loop:

df.loc[j] = pd.Series(line_dict)

This will result in quadratic time complexity.

Instead, accumulate those dicts into a list, then create a pandas dataframe from that list at the very end. So:

data = []
for line in tqdm(range(lines)):
    points = line.split()
    k = [x[0] for x in points]
    v = [x[1:] for x in points]
    line_dict = dict(zip(k, v))
    data.append(line_dict)

df = pd.DataFrame(data)

The above should be linear time.

CodePudding user response：

Specifying the sep parameter in pandas.read_csv could be a good idea. If the separator is space, then the dataframe constructing could be implemented as follows:

import pandas as pd
df = pd.read_csv('file.txt', sep=' ')