Pyspark - counting particular words in sentences-CodePudding

I have a pyspark dataframe with a column that contains textual content.

I am trying to count the number of sentences that contain an exclamation mark '!' along with the word "like" and "want".

For example: the column with a row that contains the following sentences:

I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?

The desired output i'm hoping to achieve would like something like this (only counts the sentences that contain "like" or "want" and "!"):

 ---- ----- 
|word|count|
 ---- ----- 
|like|   2 |
|want|   2 |
 ---- -----

Can someone help me with writing a UDF that can do this? This is what I have written so far, but I can't seem to get it to work.

nltk.tokenize import sent_tokenize

def convert_a_sentence(a_string):
    sentence = lower(nltk.sent_tokenize(a_string))
    return sentence

df = df.withColumn('a_sentence', convert_a_sentence(df['text']))

df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()

CodePudding user response：

If all you want is uni-gram (i.e 1 token), you can just split the sentence by space, then explode, group by, count then filter what you wants

(df
    .withColumn('words', F.split('sentence', ' '))
    .withColumn('word', F.explode('words'))
    .groupBy('word')
    .agg(
        F.count('*').alias('word_cnt')
    )
    .where(F.col('word').isin(['like', 'want']))
    .show()
)

# Output
#  ---- -------- 
# |word|word_cnt|
#  ---- -------- 
# |want|       2|
# |like|       3|
#  ---- --------

Note #1: you can apply filter before groupBy, with contains function

Note #2: If you ever want to do n-gram instead of "hacking" like above, you can consider using SparkML package with Tokenizer

from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)

# Output
#  ---------------------- ---------------------------- 
# |sentence              |words                       |
#  ---------------------- ---------------------------- 
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home!    |[i, want, to, go, home!]    |
# |I like fast food.     |[i, like, fast, food.]      |
# |you don't want to!    |[you, don't, want, to!]     |
# |what does he want?    |[what, does, he, want?]     |
#  ---------------------- ----------------------------

or NGram

from pyspark.ml.feature import NGram

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)

# Output
#  ---------------------- ---------------------------- ---------------------------------------- 
# |col                   |words                       |ngrams                                  |
#  ---------------------- ---------------------------- ---------------------------------------- 
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!]  |
# |I want to go home!    |[i, want, to, go, home!]    |[i want, want to, to go, go home!]      |
# |I like fast food.     |[i, like, fast, food.]      |[i like, like fast, fast food.]         |
# |you don't want to!    |[you, don't, want, to!]     |[you don't, don't want, want to!]       |
# |what does he want?    |[what, does, he, want?]     |[what does, does he, he want?]          |
#  ---------------------- ---------------------------- ----------------------------------------

CodePudding user response：

I'm not sure about pandas or pyspark method but you could do this pretty easily with a function

from nltk.tokenize import sent_tokenize

t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
  if "!" in sentence and "like" in sentence:
    print(f"found in {sentence}")

and you should be able to figure out how to count/put it in a table...