I have thousands of amazong product review data as JSON file. I need to process data in python and extract data from fields: "reviewText”, “overall”, and “summary”
The Json file looks like this:
{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch \"grown up\" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}
{"reviewerID": "A60D5HQFOTSOM", "asin": "B000H00VBQ", "reviewerName": "Daniel Cooper \"dancoopermedia\"", "helpful": [0, 1], "reviewText": "This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.", "overall": 1.0, "summary": "Way too boring for me", "unixReviewTime": 1381881600, "reviewTime": "10 16, 2013"}
I am trying this:
import json
with open('Amazon_Instant_Video_5.json') as json_file:
data = json.load(json_file)
print(data['reviewText']['overal']['summary'])
But it gives me this error:
JSONDecodeError Traceback (most recent call last)
/var/folders/76/9lhw7d657y757vg308n_thww0000gn/T/ipykernel_4272/378691339.py in <module>
2
3 with open('Amazon_Instant_Video_5.json') as json_file:
----> 4 data = json.load(json_file)
5 print(data['reviewText']['overal']['summary'])
~/opt/anaconda3/lib/python3.9/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
291 kwarg; otherwise ``JSONDecoder`` is used.
292 """
--> 293 return loads(fp.read(),
294 cls=cls, object_hook=object_hook,
295 parse_float=parse_float, parse_int=parse_int,
~/opt/anaconda3/lib/python3.9/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
348 cls = JSONDecoder
~/opt/anaconda3/lib/python3.9/json/decoder.py in decode(self, s, _w)
338 end = _w(s, end).end()
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj
342
JSONDecodeError: Extra data: line 2 column 1 (char 394)
CodePudding user response:
That's JSON Lines format. Each line is a JSON string. Read it a line at a time and pass it to json.loads()
:
import json
with open('Amazon_Instant_Video_5.json') as json_file:
for line in json_file:
data = json.loads(line)
print(data['reviewText'], data['overall'], data['summary'])
The "extra data" is due to json.load()
expecting the entire file to be a single JSON object and after scanning the first line thinks the JSON object is complete.
CodePudding user response:
Why are you using the normal approach op opening a file when you can use a function from the JSON module which is json.load(file_onject)
this will return an object of JSON file which you can use to get the data
Code Example
# Python program to read
# json file
import json
# Opening JSON file
f = open('data.json')
# returns JSON object as
# a dictionary
data = json.load(f)
# Iterating through the JSON
# list
for i in data:
print(i, ':', data[i])
# Closing file
f.close()