How can I transfer the exploded object back to a pyspark dataframe?-CodePudding

I am trying to convert this back to a pyspark data frame.

I am currently trying to calculate the 2-gram distribution here. I am trying to use:

new_df = sample_df_updated.select(['ngrams'])
from pyspark.sql.functions import explode
new_df.select(explode(new_df.ngrams)).show(truncate=False)

 ------------------ 
|col               |
 ------------------ 
|the project       |
|project gutenberg |
|gutenberg ebook   |
|ebook of          |
|of alice’s        |
|alice’s adventures|
|adventures in     |
|in wonderland,    |
|wonderland, by    |
|by lewis          |
|lewis carroll     |
|this ebook        |
|ebook is          |
|is for            |
|for the           |
|the use           |
|use of            |
|of anyone         |
|anyone anywhere   |
|anywhere at       |
 ------------------

I am trying to use code like this:

df2 = new_df.select(explode(new_df.ngrams)).show(truncate=False)
df2.groupBy('col').count().show()

But it results in the error

'NoneType' object has no attribute 'show'

How to transfer it into a dataframe?

CodePudding user response：

The .show() command makes df2 not a DataFrame.

Try:

df2 = new_df.select(explode(new_df.ngrams))
df2.show(truncate=False)
df2.groupBy('col').count().show()

And, could also be helpful to rename the exploded column for clarity.

df2 = new_df.select(explode(new_df.ngrams).alias('exploded_ngrams'))
df2.show(truncate=False)
df2.groupBy('exploded_ngrams').count().show()

CodePudding user response：

What about you just explode the column on the dataframe?

new_df.withColumn("ngrams", explode("ngrams")).show(truncate=False)