Home > Back-end >  Word frequency statistics algorithm
Word frequency statistics algorithm

Time:05-26

Such as the following phrases

Phrases: 1 seller in lien to looking and ㈥
Phrase 2: seller in lien to
Phrase 3: looking and (
Phrase 4: seller in lien seller
Phrase 5: seller to denominated in

Statistical continuously appear in the above phrase phrase (at least more than two words) the number of times,
The result should be:
Seller in lien to appear twice
Looking and appear 2 times
Seller appears 3 times in lien

Try N quadtree breadth traversal tree construction, the depth of the algorithm to solve the word frequency, encountered node lines into the ring, still can not find a solution,

CodePudding user response:

Send a sample out to statistical data, is a big difference of different data statistical method

CodePudding user response:

Problems just I make a simple demo,
Demand for an English article, statistical articles words present continuously more than 2 times,
Such as the essay is: I am a man, I am a programmer. I am on programming now.
Results:
I am a (2)
I am (3 times)

CodePudding user response:

https://stackoverflow.com/questions/42752356/create-a-dictionary-with-word-groups

CodePudding user response:

reference nayi_224 reply: 3/f
https://stackoverflow.com/questions/42752356/create-a-dictionary-with-word-groups

Post reply is to use the key words or phrases in advance preset dictionary,
The last reply is to use some of python libraries, don't know if you could solve the problem, in addition, python if you are familiar with, can you help me into Java?
Thank you

Meaningful_text='As a Data Scientist, you will focus on the machine
Learning and Natural Language Processing '

The from me. Collocations import *

Bigram_measures=me. Collocations. BigramAssocMeasures ()
The finder=BigramCollocationFinder. From_words (word_tokenize (meaningful_text))
Scored=finder. Score_ngrams (bigram_measures. Raw_freq)
Sorted (scored, key=lambda s: s [1], the reverse=True)

CodePudding user response:

Currently use word down line by line, then restructuring (2 ~ n) phrase, and compare the back line contains phrases, can realize to function, 10000 in short sentences, almost 1 seconds can statistics come out, so write algorithm, feels a bit stupid, do you have any familiar with algorithm, provide better algorithm?
  • Related