I have a pyspark dataframe with a column that contains textual content.
I am trying to count the number of sentences that contain an exclamation mark '!' along with the word "like" and "want".
For example: the column with a row that contains the following sentences:
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
The desired output i'm hoping to achieve would like something like this (only counts the sentences that contain "like" or "want" and "!"):
---- -----
|word|count|
---- -----
|like| 2 |
|want| 2 |
---- -----
Can someone help me with writing a UDF that can do this? This is what I have written so far, but I can't seem to get it to work.
nltk.tokenize import sent_tokenize
def convert_a_sentence(a_string):
sentence = lower(nltk.sent_tokenize(a_string))
return sentence
df = df.withColumn('a_sentence', convert_a_sentence(df['text']))
df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()
CodePudding user response:
If all you want is uni-gram (i.e 1 token), you can just split the sentence by space, then explode, group by, count then filter what you wants
(df
.withColumn('words', F.split('sentence', ' '))
.withColumn('word', F.explode('words'))
.groupBy('word')
.agg(
F.count('*').alias('word_cnt')
)
.where(F.col('word').isin(['like', 'want']))
.show()
)
# Output
# ---- --------
# |word|word_cnt|
# ---- --------
# |want| 2|
# |like| 3|
# ---- --------
Note #1: you can apply filter before groupBy
, with contains function
Note #2: If you ever want to do n-gram instead of "hacking" like above, you can consider using SparkML package with Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)
# Output
# ---------------------- ----------------------------
# |sentence |words |
# ---------------------- ----------------------------
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home! |[i, want, to, go, home!] |
# |I like fast food. |[i, like, fast, food.] |
# |you don't want to! |[you, don't, want, to!] |
# |what does he want? |[what, does, he, want?] |
# ---------------------- ----------------------------
or NGram
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)
# Output
# ---------------------- ---------------------------- ----------------------------------------
# |col |words |ngrams |
# ---------------------- ---------------------------- ----------------------------------------
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!] |
# |I want to go home! |[i, want, to, go, home!] |[i want, want to, to go, go home!] |
# |I like fast food. |[i, like, fast, food.] |[i like, like fast, fast food.] |
# |you don't want to! |[you, don't, want, to!] |[you don't, don't want, want to!] |
# |what does he want? |[what, does, he, want?] |[what does, does he, he want?] |
# ---------------------- ---------------------------- ----------------------------------------
CodePudding user response:
I'm not sure about pandas or pyspark method but you could do this pretty easily with a function
from nltk.tokenize import sent_tokenize
t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
if "!" in sentence and "like" in sentence:
print(f"found in {sentence}")
and you should be able to figure out how to count/put it in a table...