Home > database >  Use sparkstreaming and jieba realize Chinese word frequency statistics, jieba module can't find
Use sparkstreaming and jieba realize Chinese word frequency statistics, jieba module can't find


In sparkstreaming jieba implementation with Chinese word frequency statistics, use the following code in jieba module can not find the mistake, but I clearly have the import was successful, in the case of not using sparkstreaming jieba is used, can have a big help, homework tonight will be handed over # # # (in tears) (in tears) (in tears)

The from pyspark. Context import SparkContext
The import jieba
# the from pyspark. SQL. The session import SparkSession
# the from pyspark. Ml import Pipeline
# the from pyspark. Ml. Feature import StringIndexer, VectorIndexer
Sc=SparkContext (" local ", "WordCount") # initialization configuration
Data=https://bbs.csdn.net/topics/sc.textFile (r "D: \ WordCount. TXT") # read is utf-8 file
With the open (r 'd: \ stop words in Chinese libraries. TXT', 'r', encoding="utf-8") as f:
X=f.r eadlines ()
Stop=[i.r eplace (' \ n ', ') for (I) in x]
Stop. The extend ([', ', 'the', 'I', 'he', ' 'and', ', ', '\ n' and '? ', '; ', ':', '-', '(',') ', '. ', '1909', '1920', '325', 'B612', '2', '3', 'IV', 'V', 'VI', '-', ' ' ' ' ' ', '"', '"', '... ', ', ']) # stop using punctuation such as
data=https://bbs.csdn.net/topics/data.flatMap (lambda line: jieba. The cut (line, cut_all=False)). The filter (lambda w: w not stop in.)/
The map (lambda w: (w, 1)). ReduceByKey (lambda w0, w1: w0 + w1). The sortBy (lambda x: x [1], ascending=False)
Print (data. Take (100))
  • Related