Home > Back-end >  Python pandas: how does chunksize works?
Python pandas: how does chunksize works?

Time:05-21

I have the following code:

from numpy import dtype
import pandas as pd
import os
import sys

inputFile='data.json'
chunks = pd.read_json(inputFile, lines=True, chunksize = 1000)
original_stdout = sys.stdout

i = 1

for c in chunks:
    location = c.location.str.split(',')
    for b in range(1000):
        print(location[b])
        if not type(location[b]) == float:
            # get the country name
            country = location[b][-1]
        else:
            country = 'unknown'

I'm extracting the location field from a large file including json objects. Because the file is so large, I've divided it into 1000-line chunks. I cycle through each chunk and retrieve the information I require:

for c in chunks:
    location = c.a.str.split(',')
    for b in range(1000):
        print(location[b])

All goes smoothly during the first iteration. At the second iteration the line:

print(location[b])

gives the error:

ValueError: 0 is not in range

How do I cycle trough the chuncks following the first?

Thank you for your help

CodePudding user response:

The problem is that by doing location[b] you are accessing the location frame by index (i.e., here you are asking for the row with the index value b). The chunks will follow the index correctly, which means the first chunk will have the index starting by 0, the second by 1000, and so on. This means, index 0 will only be contained in the first chunk.

So, instead, you need to iterate the rows without the index:

for row in location:
   # Do something.

In fact, probably if you look at the full trace of the error you will also see a KeyError below the ValueError.

To iterate the Series and have the index you can use Series.iteritems():

for idx, row in a.iteritems():
   # Do something...
  • Related