I am using the code below to select the columns from 2 tables. I am using spark scala 2.11.11 and it runs, but, it will only return the package id and the number of packages. I need to see package id, number of packages, first name and last name in the result set. What am I missing in my code?
import org.apache.spark.sql.functions._
import spark.implicits._
flData_csv
.toDF("packageId", "flId", "date", "to", "from")
customers_csv.toDF("packageId", "firstName", "lastName")
flData_csv
.join(customers_csv, Seq("packageId"))
.select("packageId", "count", "firstName", "lastName")
.withColumnRenamed("packageId", "Package ID").groupBy("Package ID").count()
.withColumnRenamed("count", "Number of Packages")
.filter(col("count") >= 20)
.withColumnRenamed("firstName", "First Name")
.withColumnRenamed("lastName", "Last Name")
.show(100)
CodePudding user response:
After reading your code, I notice that there's .groupBy
call after packageId renaming. After .groupBy
call, basically you're left with group key(s) (Package ID
in this case) and what comes with aggregation.
I think adding firstName
and lastName
as group keys would solve your problem. Here's a sample code
flData_csv
.join(customers_csv, Seq("packageId"))
.select("packageId", "count", "firstName", "lastName")
.withColumnRenamed("packageId", "Package ID")
.groupBy("Package ID", "firstName", "lastName").count()
.withColumnRenamed("count", "Number of Packages")
.filter(col("count") >= 20)
.withColumnRenamed("firstName", "First Name")
.withColumnRenamed("lastName", "Last Name")
.show(100)