How to solve Python Pandas assign error when creating new column-CodePudding

I have a dataframe containing home descriptions:

description
0   Beautiful, spacious skylit studio in the heart...
1   Enjoy 500 s.f. top floor in 1899 brownstone, w...
2   The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3   We welcome you to stay in our lovely 2 br dupl...
4   Please don’t expect the luxury here just a bas...
5   Our best guests are seeking a safe, clean, spa...
6   Beautiful house, gorgeous garden, patio, cozy ...
7   Comfortable studio apartment with super comfor...
8   A charming month-to-month home away from home ...
9   Beautiful peaceful healthy homeThe spaceHome i...

I'm trying to count the number of sentences on each row (using sent_tokenize from nltk.tokenize) and append those values as a new column, sentence_count, to the df. Since this is part of a larger data pipeline, I'm using pandas assign so that I could chain operations.

I can't seem to get it to work, though. I've tried:

df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))

and

df.assign(sentence_count=len(sent_tokenize(df['description'])))

but both raise the following errro:

TypeError: expected string or bytes-like object

I've confirmed that each row has a dtype of str. Perhaps it's because description has dtype('O')?

What am I doing wrong here? Using a pipe with a custom function works fine here, but I prefer using assign.

CodePudding user response：

x['description'] when you pass it to sent_tokenize in the first example is a pandas.Series. It's not a string. It's a Series (similar to a list) of strings.

So instead you should do this:

df.assign(sentence_count=df['description'].apply(sent_tokenize))

Or, if you need to pass extra parameters to sent_tokenize:

df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))