How can I efficiently read the .txt data by separating by the first delimiter, and the latter separa-CodePudding

I have a mass raw data file info.txt:

1 A B F I
2 R S
3 D K L O Q T
...
999998 C X Y Z
999999 D E L
1000000 A Z

consisting one int followed by one or more str values, all separated by a blank space. I want to retrieve a pandas DataFrame df that has the following format:

number   letters
1        [A, B, F, I]
2        [R, S]
3        [D, K, L, O, Q, T]
...
999998   [C, X, Y, Z]
999999   [D, E, L]
1000000  [A, Z]

What I have performed is the following:

df = pd.read_csv(r"../info.txt", header=None)
df.columns = ['number']
df[['number','letters']] = df["number"].str.split(" ", 1, expand=True)
for i in range(len(df)):
    df['letters'][i] = df['letters'][i].split()

And this does the trick. The problem is that due to its size, the function takes a little less than two whole minutes.

The one that takes the most time is the for loop, the operation that switches each row's letters string into an array. Is there a more efficient way to format, maybe even directly while reading the info.txt?

(I have explored previous questions such as this and this, but I couldn't quite apply them to my data.)

CodePudding user response：

This can be actually optimized by just reading the file's contents with Python primitives. I suspect read_csv attempts do so a bunch of inference and data type magic that's not required for plain data like this.

def so_me():
    numbers = []
    letters = []
    with open('./so71567951-data.txt') as f:
        for line in f:
            n, *l = line.rstrip().split()
            numbers.append(int(n))
            letters.append(list(l))
    return pd.DataFrame({'number': numbers, 'letters': letters})

This seems to be 50%-60% faster than the other solution (71568114).

I ran out of patience testing the original with the bigger datasets, but over the original this is 43x faster on a 100 000 line dataset.

# 5 million lines
name='so_71568114' iters=1 time=9.284 iters_per_sec=0.11
name='so_me' iters=1 time=6.065 iters_per_sec=0.16

# 1.5 million lines
name='so_71568114' iters=1 time=2.911 iters_per_sec=0.34
name='so_me' iters=1 time=1.766 iters_per_sec=0.57

# 100,000 lines
name='so_original' iters=1 time=11.684 iters_per_sec=0.09
name='so_71568114' iters=2 time=0.359 iters_per_sec=5.58
name='so_me' iters=2 time=0.241 iters_per_sec=8.30

CodePudding user response：

Let us optimize your code:

s = pd.read_csv('../info.txt', header=None)[0].str.split().str
df_out = pd.DataFrame({'number': s[0], 'letters': s[1:]})

print(df_out)
    number             letters
0        1        [A, B, F, I]
1        2              [R, S]
2        3  [D, K, L, O, Q, T]
3   999998        [C, X, Y, Z]
4   999999           [D, E, L]
5  1000000              [A, Z]