I am trying to load books reviews from this page
CodePudding user response:
Loading lines one at a time into DataFrames just to check their rating is incredibly inefficient, it's better to treat everything as dictionaries and make some Series at the end.
import json
import gzip
import pandas as pd
def parse(path):
g = gzip.open(path, 'r')
for l in g:
yield json.loads(l)
file = parse('Books_5.json.gz')
pos_revs = []
neg_revs = []
while len(pos_revs) < 250000 or len(neg_revs) < 250000:
line = next(file)
rating = line['overall']
if len(pos_revs) < 250000 and rating > 3:
review = line.get('reviewText')
if review:
pos_revs.append(review)
if len(neg_revs) < 250000 and rating < 3:
review = line.get('reviewText')
if review:
neg_revs.append(line.get('reviewText'))
pos_revs = pd.Series(pos_revs)
neg_revs = pd.Series(neg_revs)
print(pos_revs)
print(neg_revs)
Output:
0 The King, the Mice and the Cheese by Nancy Gur...
1 The kids loved it!
2 My students (3 & 4 year olds) loved this book!...
3 LOVE IT
4 Great!
...
249995 Great read. Dis t want to put it down.
249996 Love this series
249997 So I am one of those people who absolutely lov...
249998 I learned a great deal from this book. The Fre...
249999 Having already read Tuchman's book on the outb...
Length: 250000, dtype: object
0 Looking for a Louis Untermeyer book from the ...
1 Completly boring!!! Yes it's a childerns book ...
2 I don't like Hillerman novels. It was chosen ...
3 I have read many of the Hillerman books and en...
4 I really love Hillerman's books. He is one of...
...
249995 When I first started reading SUSPECT, I though...
249996 I really despised this book. Sure it portrays...
249997 This is a bleak novel. The mindless violence t...
249998 Like the title says, this is not as good as Ch...
249999 Great concept. Predictable, poorly written sto...
Length: 250000, dtype: object
Or a purely pandas version could look something like this, and possibly be faster:
reader = pd.read_json('Books_5.json.gz',lines=True, chunksize=100000)
pos_revs = pd.DataFrame()
neg_revs = pd.DataFrame()
for chunk in reader:
if pos := (len(pos_revs) < 250000):
temp_pos = chunk[chunk['overall'].gt(3)][['summary']]
pos_revs = pd.concat([pos_revs, temp_pos], ignore_index=True)
# OPTIONAL:
pos_revs.drop_duplicates(inplace=True, ignore_index=True)
if neg := (len(neg_revs) < 250000):
temp_neg = chunk[chunk['overall'].lt(3)][['summary']]
neg_revs = pd.concat([neg_revs, temp_neg], ignore_index=True)
# OPTIONAL:
neg_revs.drop_duplicates(inplace=True, ignore_index=True)
if not neg and not pos:
break
print(pos_revs)
print(neg_revs)
Output:
summary
0 A story children will love and learn from
1 Five Stars
2 Not Nice Mice
3 One of my favorite kids' stories
4 One of our families favorite books!!!
... ...
294397 PRATCHETT ON TOP FORM WITH THIS BRILLIANT NEW ...
294398 Pratchett's aphorisms get better and better
294399 Thief of Time - John Deakins for ABSOLUTE MAGN...
294400 An absolute masterpiece!
294401 Beautiful, Engaging, A Classic
[294402 rows x 1 columns]
summary
0 Two Stars
1 Don't waste your money
2 Tony missed the mark
3 Don't Start with This One!
4 Nothing special
... ...
253377 Not the caliber of "Naked"
253378 "Weak Stories" b/w "One Ace Essay"
253379 Fast service - product smells of mildew/mold.
253380 Nothing new, classical narration, good against...
253381 This is a magnificent book-hardcover version b...
[253382 rows x 1 columns]
I'm not sure which method is faster, but both take less than a minute to run on the 6GB file.