Processing dataframe in chunks-CodePudding

I need to process a large dataframe in chunks and I applied this function:

def chunker(seq, size):
    return (seq[pos:pos   size] for pos in range(0, len(seq), size))

for i in chunker(df,chunk_size):
      ....

However, when I run this, I get error: ValueError: invalid literal for int() with base 10:

Do you have another way to process dataframe in chunks or to adjust above script?

thanks !

CodePudding user response：

You need to use iloc for this index sliciing over the rows:

def chunker(seq, size):
    return (seq.iloc[pos:pos   size] for pos in range(0, len(seq), size))

for i in chunker(df,chunk_size):
      ....

the reason is df[] is for looking up columns and it does not take a slice argument. df.loc is for row-index lookups which do not necessarily match incremental indexing (position based). You can read this for a more detailed explanation.

CodePudding user response：

thanks for your quick answer. I tried again following your recommendation, but still I get error:

ValueError: invalid literal for int() with base 10: 'xxx'

if I run the same code for a different dataframe, built using np.random.randn, it works fine. Also, the error is not given if I do not divide the dataframe in chunks. Any clue?