I have the following code:
from numpy import dtype
import pandas as pd
import os
import sys
inputFile='data.json'
chunks = pd.read_json(inputFile, lines=True, chunksize = 1000)
original_stdout = sys.stdout
i = 1
for c in chunks:
location = c.location.str.split(',')
for b in range(1000):
print(location[b])
if not type(location[b]) == float:
# get the country name
country = location[b][-1]
else:
country = 'unknown'
I'm extracting the location field from a large file including json objects. Because the file is so large, I've divided it into 1000-line chunks. I cycle through each chunk and retrieve the information I require:
for c in chunks:
location = c.a.str.split(',')
for b in range(1000):
print(location[b])
All goes smoothly during the first iteration. At the second iteration the line:
print(location[b])
gives the error:
ValueError: 0 is not in range
How do I cycle trough the chuncks following the first?
Thank you for your help
CodePudding user response:
The problem is that by doing location[b]
you are accessing the location
frame by index (i.e., here you are asking for the row with the index value b
). The chunks will follow the index correctly, which means the first chunk will have the index starting by 0
, the second by 1000
, and so on. This means, index 0
will only be contained in the first chunk.
So, instead, you need to iterate the rows without the index:
for row in location:
# Do something.
In fact, probably if you look at the full trace of the error you will also see a KeyError
below the ValueError
.
To iterate the Series and have the index you can use Series.iteritems()
:
for idx, row in a.iteritems():
# Do something...