My code basically looks like this until now after importing the dataset, libraries and all of that:
data = pd.read_csv("/content/gdrive/MyDrive/Data/tripadvisor_hotel_reviews.csv")
reviews = data['Review'].str.lower()
#Check
print(reviews)
print(type('Review'))
print(type(reviews))
The output, however, looks like this:
0 nice hotel expensive parking got good deal sta...
1 ok nothing special charge diamond member hilto...
2 nice rooms not 4* experience hotel monaco seat...
3 unique, great stay, wonderful time hotel monac...
4 great stay great stay, went seahawk game aweso...
...
20486 best kept secret 3rd time staying charm, not 5...
20487 great location price view hotel great quick pl...
20488 ok just looks nice modern outside, desk staff ...
20489 hotel theft ruined vacation hotel opened sept ...
20490 people talking, ca n't believe excellent ratin...
Name: Review, Length: 20491, dtype: object
<class 'str'>
<class 'pandas.core.series.Series'>
I want to know why the variable "reviews" is a different type than the data column "Review" if I (supposedly) set them to equal.
This is problem because when I try to tokenize my data, it shows an error.
My code for tokenize:
word_tokenize(reviews)
The error I get:
**TypeError** Traceback (most recent call last)
<ipython-input-9-ebaf7dca0fec> in <module>()
----> 1 word_tokenize(reviews)
8 frames
/usr/local/lib/python3.7/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
1287 def _slices_from_text(self, text):
1288 last_break = 0 ->
1289 for match in self._lang_vars.period_context_re().finditer(text):
1290 context = match.group() match.group('after_tok')
1291 if self.text_contains_sentbreak(context):
**TypeError:** expected string or bytes-like object
CodePudding user response:
There are many things going on here. First of all, reviews
is a pd.Series
. This means that
word_tokenize(reviews)
won't work, because you can't tokenize a series
of strings. You can tokenize, however, a string. The following should work
tokens = [word_tokenize(review) for review in reviews]
because review
above is a string, and you are tokenizing each string in the whole pd.Series
of strings named reviews
.
Also, comparing type('Review')
and type(reviews)
makes no sense. reviews
is a pd.Series
(i.e. an iterable) with many different strings, while "Review"
is a string instance that holds the English word "Review"
in it. type('Review')
will always be string. In contrast, type(reviews)
might change depending on what value the variable reviews
hold.