I have json file that looks like this:
{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch \"grown up\" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}
{"reviewerID": "A60D5HQFOTSOM", "asin": "B000H00VBQ", "reviewerName": "Daniel Cooper \"dancoopermedia\"", "helpful": [0, 1], "reviewText": "This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.", "overall": 1.0, "summary": "Way too boring for me", "unixReviewTime": 1381881600, "reviewTime": "10 16, 2013"}
I need to extract data from fields "summary" and "reviewText" and store it in two new files for further analysis, like tokenization.
I am trying this:
import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")
with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
for line in json_file: #runs the loop to extract info
data = json.loads(line)
rt.write(data['reviewText'])
su.write(data['summary'])
rt.close()
su.closed()
Because sentences in summary do not have suspension points (dots) at the end, it saves all strings as one sentence, like this:
A little bit boring for meExcellent Grown Up TVWay too boring for meRobson Green is mesmerizing
This makes tokenization impossible. How can I sove this problem?
CodePudding user response:
All you need to do is adding \n
to end of sentences. (\n is an escape character for strings that is replaced with the new line object)
So, your code evulates to this:
import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")
with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
for line in json_file: #runs the loop to extract info
data = json.loads(line)
rt.write(data['reviewText'] '\n')
su.write(data['summary'] '\n')
rt.close()
su.close()