I want to count the frequency of 2 words combination in all the rows of a column.
I have a table with two columns - The first is a column with a sentence while the other is the bigram tokenization of that sentence.
Sentence | words |
---|---|
'beautiful day suffered through ' | 'beautiful day' |
'beautiful day suffered through ' | 'day suffered' |
'beautiful day suffered through ' | 'suffered through' |
'cannot hold back tears ' | 'cannot hold' |
'cannot hold back tears ' | 'hold back' |
'cannot hold back tears ' | 'back tears' |
'ash back tears beautiful day ' | 'ash back' |
'ash back tears beautiful day ' | 'back tears' |
'ash back tears beautiful day ' | 'tears beautiful' |
'ash back tears beautiful day ' | 'beautiful day' |
My desired output is a column counting the frequency of the words in all the sentences throughout the whole df['Sentence'] column. Something like this:
Sentence | Words | Total |
---|---|---|
'beautiful day suffered through ' | 'beautiful day' | 2 |
'beautiful day suffered through ' | 'day suffered' | 1 |
'beautiful day suffered through ' | 'suffered through' | 1 |
'cannot hold back tears ' | 'cannot hold' | 1 |
'cannot hold back tears ' | 'hold back' | 1 |
'cannot hold back tears ' | 'back tears' | 2 |
'ash back tears beautiful day ' | 'ash back' | 1 |
'ash back tears beautiful day ' | 'back tears' | 2 |
'ash back tears beautiful day ' | 'tears beautiful' | 1 |
'ash back tears beautiful day ' | 'beautiful day' | 2 |
and so on.
The code I have tried repeats the first same frequency until the end of the sentence.
df.Sentence.str.count('|'.join(df.words.tolist()))
So not what I am looking for and it also takes a very long time as my original df is much larger.
Is there any alternative or any function in the NLTK or any other library?
CodePudding user response:
The way I understand it is that you want a bi-gram count as contained in each unique sentence. The answer for that already exists in the words column. value_counts()
is used to deliver that.
df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total'))
Sentence words words_total
0 beautiful day suffered through beautiful day 2
1 beautiful day suffered through day suffered 1
2 beautiful day suffered through suffered through 1
3 cannot hold back tears cannot hold 1
4 cannot hold back tears hold back 1
5 cannot hold back tears back tears 2
6 ash back tears beautiful day ash back 1
7 ash back tears beautiful day back tears 2
8 ash back tears beautiful day tears beautiful 1
9 ash back tears beautiful day beautiful day 2
CodePudding user response:
I suggest:
- Start by removing the quotes and whitespaces at the beginning and end of both
Sentences
andwords
data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()
- Then set
Sentences
andwords
as string objects:
data = data.astype({"Sentence":str, "words": str})
print(data)
#Output
Sentence words
0 beautiful day suffered through beautiful day
1 beautiful day suffered through day suffered
2 beautiful day suffered through suffered through
3 cannot hold back tears cannot hold
4 cannot hold back tears hold back
5 cannot hold back tears back tears
6 ash back tears beautiful day ash back
7 ash back tears beautiful day back tears
8 ash back tears beautiful day tears beautiful
9 ash back tears beautiful day beautiful day
- Count the occurrence of the given words in the sentence on the same row and store in a column e.g
words_occur
def words_in_sent(row):
return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)
- Finally groupby
words
and sum up their occurrences:
data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)
Result
Sentence words words_occur total
0 beautiful day suffered through beautiful day 1 2
1 beautiful day suffered through day suffered 1 1
2 beautiful day suffered through suffered through 1 1
3 cannot hold back tears cannot hold 1 1
4 cannot hold back tears hold back 1 1
5 cannot hold back tears back tears 1 2
6 ash back tears beautiful day ash back 1 1
7 ash back tears beautiful day back tears 1 2
8 ash back tears beautiful day tears beautiful 1 1
9 ash back tears beautiful day beautiful day 1 2