In sparkstreaming jieba implementation with Chinese word frequency statistics, use the following code in jieba module can not find the mistake, but I clearly have the import was successful, in the case of not using sparkstreaming jieba is used, can have a big help, homework tonight will be handed over # # # (in tears) (in tears) (in tears)
The from pyspark. Context import SparkContext
The import jieba
# the from pyspark. SQL. The session import SparkSession
# the from pyspark. Ml import Pipeline
# the from pyspark. Ml. Feature import StringIndexer, VectorIndexer
Sc=SparkContext (" local ", "WordCount") # initialization configuration
Data=https://bbs.csdn.net/topics/sc.textFile (r "D: \ WordCount. TXT") # read is utf-8 file
With the open (r 'd: \ stop words in Chinese libraries. TXT', 'r', encoding="utf-8") as f:
X=f.r eadlines ()
Stop=[i.r eplace (' \ n ', ') for (I) in x]
Stop. The extend ([', ', 'the', 'I', 'he', ' 'and', ', ', '\ n' and '? ', '; ', ':', '-', '(',') ', '. ', '1909', '1920', '325', 'B612', '2', '3', 'IV', 'V', 'VI', '-', ' ' ' ' ' ', '"', '"', '... ', ', ']) # stop using punctuation such as
data=https://bbs.csdn.net/topics/data.flatMap (lambda line: jieba. The cut (line, cut_all=False)). The filter (lambda w: w not stop in.)/
The map (lambda w: (w, 1)). ReduceByKey (lambda w0, w1: w0 + w1). The sortBy (lambda x: x [1], ascending=False)
Print (data. Take (100))