I would like to replicate the Pandas nunique function with Spark SQL and DataFrame. I have the following:
%spark
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions._
val df = spark.read
.format("csv")
.option("delimiter", ";")
.option("header", "true") //first line in file has headers
.load("target/youtube_videos.csv")
println("Distinct Count: " df.distinct().count())
val df2 = df.select(countDistinct("likes"))
df2.show(false)
This works and prints the unique count for the likes column as below:
Distinct Count: 109847
---------------------
|count(DISTINCT likes)|
---------------------
|27494 |
---------------------
How can I do this in one SQL so that I can get a summary of all the individual columns?
CodePudding user response:
You can easily iterate on all the columns within your DataFrame and collect your result , else you can convert it to a DataFrame as well
import pyspark.sql.functions as F
data = {"col1" : [np.random.randint(10) for x in range(1,10)],
"col2" : [np.random.randint(100) for x in range(1,10)]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
---- ----
|col1|col2|
---- ----
| 4| 54|
| 2| 90|
| 4| 70|
| 4| 37|
| 7| 63|
| 8| 59|
| 0| 52|
| 2| 76|
| 7| 79|
---- ----
Iterating DataFrame Columns
sparkDF.select([F.countDistinct(F.col(c)).alias(c) for c in sparkDF.columns]).show()
---- ----
|col1|col2|
---- ----
| 5| 9|
---- ----
CodePudding user response:
I was able to get the results like I wanted as below:
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
// Print the unique values for each column
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*).show(false)
The snippet above prints:
------------- ---------- ----------- -------- ----------------- ----- ----- -------- -------- -------- -------------------- -------------------- --------------
|channel_title|channel_id|video_title|video_id|video_upload_date|views|likes|dislikes|comments|age_days|likes_dislikes_ratio|comments_views_ratio|mean_views_day|
------------- ---------- ----------- -------- ----------------- ----- ----- -------- -------- -------- -------------------- -------------------- --------------
|127 |127 |108980 |109846 |109692 |86107|27494|5457 |7104 |5195 |69021 |100809 |109714 |
------------- ---------- ----------- -------- ----------------- ----- ----- -------- -------- -------- -------------------- -------------------- --------------