Home > database >  Azure Apache Spark groupby clause throws an error
Azure Apache Spark groupby clause throws an error

Time:11-22

I am following this section of a tutorial on Apache Spark from Azure team. But when I try to use BroupBy function of DataFrame, I get the following error:

Error:

NameError: name 'TripDistanceMiles' is not defined

Question: What may be a cause of the error in the following code, and how can it be fixed?

NOTE: I know how to group by the following results using Spark SQL as it is shown in a later section of the same tutorial. But I am interested in using the Groupby clause on the DataFrame

Details:

a) Following code correctly displays 100 rows with column headers PassengerCount and TripDistanceMiles:

%%pyspark
df = spark.read.load('abfss://[email protected]/NYCTripSmall.parquet', format='parquet')
display(df.select("PassengerCount","TripDistanceMiles").limit(100))

b) But the following code does not group by the records and throws error shown above:

%%pyspark
df = spark.read.load('abfss://[email protected]/NYCTripSmall.parquet', format='parquet')
df = df.select("PassengerCount","TripDistanceMiles").limit(100)
display(df.groupBy("PassengerCount").sum(TripDistanceMiles).limit(100))

CodePudding user response:

Try putting the TripDistanceMiles into double quotes. Like

display(df.groupBy("PassengerCount").sum("TripDistanceMiles").limit(100))
  • Related