I'm using pandas to read large size file,so I use:
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
and for each chunk I need add a column 'seqnum',it will the consistently index for all chunks:
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
df_small ['seqnum'] == df_small .index.values
So for the first chunk the df_small ['seqnum'] will be:
0
1
2
...
999
But the df_small ['seqnum'] of second chunk will still be:
0
1
2
...
999
That is not what I want, the ideal df_small ['seqnum'] of the second chunk should be:
1000
1001
1002
...
1999
Is there anyway can do that?
CodePudding user response:
Use the index of df_small
:
for df_small in pd.read_csv("data1.csv", chunksize=3,
iterator=True, low_memory=False):
df_small['seqnum'] = df_small.index.values
print(df_small)
Output:
Name seqnum # <- 1st iteration
0 A 0
1 B 1
2 C 2
Name seqnum # <- 2nd iteration
3 D 3
4 E 4
5 F 5
Name seqnum # <- 3rd iteration
6 G 6
7 H 7
8 I 8
Name seqnum # <- 4th iteration
9 J 9
10 K 10
11 L 11
CodePudding user response:
Just create a variable to track the start index of the next chunk as follows:
seq_num = 0
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
iterator=True, low_memory=False):
df['seqnum'] = df.index seq_num
seq_num = df.index[-1] 1