I have a mass raw data file info.txt
:
1 A B F I
2 R S
3 D K L O Q T
...
999998 C X Y Z
999999 D E L
1000000 A Z
consisting one int
followed by one or more str
values, all separated by a blank space. I want to retrieve a pandas
DataFrame df
that has the following format:
number letters
1 [A, B, F, I]
2 [R, S]
3 [D, K, L, O, Q, T]
...
999998 [C, X, Y, Z]
999999 [D, E, L]
1000000 [A, Z]
What I have performed is the following:
df = pd.read_csv(r"../info.txt", header=None)
df.columns = ['number']
df[['number','letters']] = df["number"].str.split(" ", 1, expand=True)
for i in range(len(df)):
df['letters'][i] = df['letters'][i].split()
And this does the trick. The problem is that due to its size, the function takes a little less than two whole minutes.
The one that takes the most time is the for
loop, the operation that switches each row's letters
string into an array. Is there a more efficient way to format, maybe even directly while reading the info.txt
?
(I have explored previous questions such as this and this, but I couldn't quite apply them to my data.)
CodePudding user response:
This can be actually optimized by just reading the file's contents with Python primitives. I suspect read_csv
attempts do so a bunch of inference and data type magic that's not required for plain data like this.
def so_me():
numbers = []
letters = []
with open('./so71567951-data.txt') as f:
for line in f:
n, *l = line.rstrip().split()
numbers.append(int(n))
letters.append(list(l))
return pd.DataFrame({'number': numbers, 'letters': letters})
This seems to be 50%-60% faster than the other solution (71568114).
I ran out of patience testing the original with the bigger datasets, but over the original this is 43x faster on a 100 000 line dataset.
# 5 million lines
name='so_71568114' iters=1 time=9.284 iters_per_sec=0.11
name='so_me' iters=1 time=6.065 iters_per_sec=0.16
# 1.5 million lines
name='so_71568114' iters=1 time=2.911 iters_per_sec=0.34
name='so_me' iters=1 time=1.766 iters_per_sec=0.57
# 100,000 lines
name='so_original' iters=1 time=11.684 iters_per_sec=0.09
name='so_71568114' iters=2 time=0.359 iters_per_sec=5.58
name='so_me' iters=2 time=0.241 iters_per_sec=8.30
CodePudding user response:
Let us optimize your code:
s = pd.read_csv('../info.txt', header=None)[0].str.split().str
df_out = pd.DataFrame({'number': s[0], 'letters': s[1:]})
print(df_out)
number letters
0 1 [A, B, F, I]
1 2 [R, S]
2 3 [D, K, L, O, Q, T]
3 999998 [C, X, Y, Z]
4 999999 [D, E, L]
5 1000000 [A, Z]