I have a .txt file with lines such as "G1 X174.774 Y46.362 E1.48236", "M73 Q1 S245", all with one letter then a number and then a space. I'm trying to create a dataframe such that each row is a line from my file and each column is a letter. If my file were just the two lines above, my resulting dataframe would be
G X Y E M Q S
1 174.774 46.362 1.48236 0 0 0
0 0 0 0 73 1 245
So far I have a dataframe with the columns of all possible letters in the .txt file, and the .txt file is now represented as a list of strings representing each line of the file. As of now I can only figure out how to add each line individually to the df with the following for loop:
for j in tqdm(range(len(lines))):
line = lines[j]
points = line.split()
k = [x[0] for x in points]
v = [x[1:] for x in points]
line_dict = dict(zip(k, v))
df.loc[j] = pd.Series(line_dict)
This gives me my desired result (the unspecified values are NaN, but I can change these to zero later), but as my files have 200k lines, it's taking about an hour per file. Is there a faster way I could do this? I've been trying to think of a way to use list comprehension, but using the dict is confusing me a bit, and I'm not sure how much faster that would make things anyway. I haven't been able to find much on stackoverflow about this subject, but if I missed something please feel free to share the link with me! Thanks!
CodePudding user response:
Yes, I suspect there is. Do not incrementally increase the number of rows in a dataframe in a loop:
df.loc[j] = pd.Series(line_dict)
This will result in quadratic time complexity.
Instead, accumulate those dicts into a list, then create a pandas dataframe from that list at the very end. So:
data = []
for line in tqdm(range(lines)):
points = line.split()
k = [x[0] for x in points]
v = [x[1:] for x in points]
line_dict = dict(zip(k, v))
data.append(line_dict)
df = pd.DataFrame(data)
The above should be linear time.
CodePudding user response:
Specifying the sep
parameter in pandas.read_csv
could be a good idea. If the separator is space, then the dataframe constructing could be implemented as follows:
import pandas as pd
df = pd.read_csv('file.txt', sep=' ')