Grouping in pySpark Dataframes-CodePudding

I am using spark dataframes. The task is this: to calculate and display in descending order the number of cities in the country grouped by country and region.

Initial data:

from pyspark.sql.functions import col 
from pyspark.sql.functions import count
df = spark.read.json("/content/world-cities.json")
df.printSchema()
df.show()

enter image description here

Desired result: enter image description here

I get grouping only by the country column. How to add grouping by second column subcountry?

df.groupBy(col('country')).agg(count("*").alias("cnt"))\
  .orderBy(col('cnt').desc())\
  .show()

enter image description here

CodePudding user response：

If i understand you correctly you just need to add second column to your group by

import pyspark.sql.functions as F

x = [("USA","usa-subcountry", "usa-city"),("USA","usa-subcountry", "usa-city-2"),("USA","usa-subcountry-2", "usa-city"), ("Argentina","argentina-subcountry", "argentina-city")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("cnt"))\
  .orderBy(F.col('cnt').desc())\
  .show()

Output is:

 --------- -------------------- --- 
|  country|          subcountry|cnt|
 --------- -------------------- --- 
|      USA|      usa-subcountry|  2|
|      USA|    usa-subcountry-2|  1|
|Argentina|argentina-subcountry|  1|
 --------- -------------------- ---

Edit: another try based on comment:

import pyspark.sql.functions as F

x = [("USA","usa-subcountry", "usa-city"),
     ("USA","usa-subcountry", "usa-city-2"),
     ("USA","usa-subcountry", "usa-city-3"),
     ("USA","usa-subcountry-2", "usa-city"),
     ("Argentina","argentina-subcountry", "argentina-city"),
     ("Argentina","argentina-subcountry-2", "argentina-city-2"),
     ("UK","UK-subcountry", "UK-city-1")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])

df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("city_count"))\
  .groupBy(F.col('country')).agg(F.count("*").alias("subcountry_count"), F.sum('city_count').alias("city_count"))\
  .orderBy(F.col('city_count').desc())\
  .show()

output:

 --------- ---------------- ---------- 
|  country|subcountry_count|city_count|
 --------- ---------------- ---------- 
|      USA|               2|         4|
|Argentina|               2|         2|
|       UK|               1|         1|
 --------- ---------------- ----------

I am assuming that cities and subcountries are unique, if not you may consider to use countDistinct instead of count