I have a dataframe containing home descriptions:
description
0 Beautiful, spacious skylit studio in the heart...
1 Enjoy 500 s.f. top floor in 1899 brownstone, w...
2 The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3 We welcome you to stay in our lovely 2 br dupl...
4 Please don’t expect the luxury here just a bas...
5 Our best guests are seeking a safe, clean, spa...
6 Beautiful house, gorgeous garden, patio, cozy ...
7 Comfortable studio apartment with super comfor...
8 A charming month-to-month home away from home ...
9 Beautiful peaceful healthy homeThe spaceHome i...
I'm trying to count the number of sentences on each row (using sent_tokenize
from nltk.tokenize
) and append those values as a new column, sentence_count
, to the df
. Since this is part of a larger data pipeline, I'm using pandas assign
so that I could chain operations.
I can't seem to get it to work, though. I've tried:
df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))
and
df.assign(sentence_count=len(sent_tokenize(df['description'])))
but both raise the following errro:
TypeError: expected string or bytes-like object
I've confirmed that each row has a dtype
of str
. Perhaps it's because description
has dtype('O')
?
What am I doing wrong here? Using a pipe
with a custom function works fine here, but I prefer using assign
.
CodePudding user response:
x['description']
when you pass it to sent_tokenize
in the first example is a pandas.Series
. It's not a string. It's a Series (similar to a list) of strings.
So instead you should do this:
df.assign(sentence_count=df['description'].apply(sent_tokenize))
Or, if you need to pass extra parameters to sent_tokenize
:
df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))