(TypeError: expected string or bytes-like object) Why if my variable has my data (string) storaged t-CodePudding

My code basically looks like this until now after importing the dataset, libraries and all of that:

data = pd.read_csv("/content/gdrive/MyDrive/Data/tripadvisor_hotel_reviews.csv")
reviews = data['Review'].str.lower()

#Check
print(reviews)
print(type('Review'))
print(type(reviews))

The output, however, looks like this:

0        nice hotel expensive parking got good deal sta...
1        ok nothing special charge diamond member hilto...
2        nice rooms not 4* experience hotel monaco seat...
3        unique, great stay, wonderful time hotel monac...
4        great stay great stay, went seahawk game aweso...
                               ...                        
20486    best kept secret 3rd time staying charm, not 5...
20487    great location price view hotel great quick pl...
20488    ok just looks nice modern outside, desk staff ...
20489    hotel theft ruined vacation hotel opened sept ...
20490    people talking, ca n't believe excellent ratin...
Name: Review, Length: 20491, dtype: object
<class 'str'>
<class 'pandas.core.series.Series'>

I want to know why the variable "reviews" is a different type than the data column "Review" if I (supposedly) set them to equal.

This is problem because when I try to tokenize my data, it shows an error.

My code for tokenize:

word_tokenize(reviews)

The error I get:

**TypeError** Traceback (most recent call last)

<ipython-input-9-ebaf7dca0fec> in <module>()

----> 1 word_tokenize(reviews)

8 frames

/usr/local/lib/python3.7/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)

1287 def _slices_from_text(self, text):

1288 last_break = 0 ->

1289 for match in self._lang_vars.period_context_re().finditer(text):

1290 context = match.group()   match.group('after_tok')

1291 if self.text_contains_sentbreak(context):

**TypeError:** expected string or bytes-like object

CodePudding user response：

There are many things going on here. First of all, reviews is a pd.Series. This means that

word_tokenize(reviews)

won't work, because you can't tokenize a series of strings. You can tokenize, however, a string. The following should work

tokens = [word_tokenize(review) for review in reviews]

because review above is a string, and you are tokenizing each string in the whole pd.Series of strings named reviews.

Also, comparing type('Review') and type(reviews) makes no sense. reviews is a pd.Series (i.e. an iterable) with many different strings, while "Review" is a string instance that holds the English word "Review" in it. type('Review') will always be string. In contrast, type(reviews) might change depending on what value the variable reviews hold.