I am very new at using Pyspark and have some issues with Pyspark Dataframe.
I'm trying to implement the TF-IDF algorithm. I did it with pandas dataframe once. However, I started using Pyspark and now everything changed :( I can't use Pyspark Dataframe like dataframe['ColumnName']
. When I write and run the code, it says dataframe is not iterable.
This is a massive problem for me and has not been solved yet. The current problem below:
With Pandas:
tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])
With PySpark:
tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
13 vocabulary = list(vocabulary)
14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
16 tfidf_tran = tfidf.transform(dataframe['name'])
17
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
1821 self._check_params()
1822 self._warn_for_unused_params()
-> 1823 X = super().fit_transform(raw_documents)
1824 self._tfidf.fit(X)
1825 return self
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
1110 values = _make_int_array()
1111 indptr.append(0)
-> 1112 for doc in raw_documents:
1113 feature_counter = {}
1114 for feature in analyze(doc):
E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
458
459 def __iter__(self):
--> 460 raise TypeError("Column is not iterable")
461
462 # string methods
TypeError: Column is not iterable
CodePudding user response:
Tf-idf is the term frequency multiplied by the inverse document frequency. There isn't an explicit tf-idf vectorizer within the MlLib for dataframes in the Pyspark library, but they have 2 useful models that will help get you to the tf-idf. Using the HashingTF, you'd get the term frequencies. Using the IDF, you'd have the inverse document frequencies. Multiply the two together, and you should have an output matrix matching what you would be expecting from the TfidfVectorizer you specified originally.