I am trying to convert this back to a pyspark data frame.
I am currently trying to calculate the 2-gram distribution here. I am trying to use:
new_df = sample_df_updated.select(['ngrams'])
from pyspark.sql.functions import explode
new_df.select(explode(new_df.ngrams)).show(truncate=False)
------------------
|col |
------------------
|the project |
|project gutenberg |
|gutenberg ebook |
|ebook of |
|of alice’s |
|alice’s adventures|
|adventures in |
|in wonderland, |
|wonderland, by |
|by lewis |
|lewis carroll |
|this ebook |
|ebook is |
|is for |
|for the |
|the use |
|use of |
|of anyone |
|anyone anywhere |
|anywhere at |
------------------
I am trying to use code like this:
df2 = new_df.select(explode(new_df.ngrams)).show(truncate=False)
df2.groupBy('col').count().show()
But it results in the error
'NoneType' object has no attribute 'show'
How to transfer it into a dataframe?
CodePudding user response:
The .show()
command makes df2
not a DataFrame.
Try:
df2 = new_df.select(explode(new_df.ngrams))
df2.show(truncate=False)
df2.groupBy('col').count().show()
And, could also be helpful to rename the exploded column for clarity.
df2 = new_df.select(explode(new_df.ngrams).alias('exploded_ngrams'))
df2.show(truncate=False)
df2.groupBy('exploded_ngrams').count().show()
CodePudding user response:
What about you just explode the column on the dataframe?
new_df.withColumn("ngrams", explode("ngrams")).show(truncate=False)