How to have consistently index number in different pandas dataframe chunks-CodePudding

I'm using pandas to read large size file,so I use:

for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                         iterator=True, low_memory=False):

and for each chunk I need add a column 'seqnum',it will the consistently index for all chunks:

for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                             iterator=True, low_memory=False):

    df_small ['seqnum'] == df_small .index.values

So for the first chunk the df_small ['seqnum'] will be:

0
1
2
...
999

But the df_small ['seqnum'] of second chunk will still be:

0
1
2
...
999

That is not what I want, the ideal df_small ['seqnum'] of the second chunk should be:

Is there anyway can do that?

CodePudding user response：

Use the index of df_small:

for df_small in pd.read_csv("data1.csv", chunksize=3,
                             iterator=True, low_memory=False):
    df_small['seqnum'] = df_small.index.values
    print(df_small)

Output:

  Name  seqnum  # <- 1st iteration
0    A       0
1    B       1
2    C       2

  Name  seqnum  # <- 2nd iteration
3    D       3
4    E       4
5    F       5

  Name  seqnum  # <- 3rd iteration
6    G       6
7    H       7
8    I       8

   Name  seqnum  # <- 4th iteration
9     J       9
10    K      10
11    L      11

CodePudding user response：

Just create a variable to track the start index of the next chunk as follows:

seq_num = 0
for df_small in pd.read_csv("largefile.txt", chunksize=1000,
                         iterator=True, low_memory=False):

    df['seqnum'] = df.index   seq_num
    seq_num = df.index[-1]   1