I am following this section of a tutorial on Apache Spark
from Azure team. But when I try to use BroupBy
function of DataFrame
, I get the following error:
Error:
NameError: name 'TripDistanceMiles' is not defined
Question: What may be a cause of the error in the following code, and how can it be fixed?
NOTE: I know how to group by the following results using Spark SQL as it is shown in a later section of the same tutorial. But I am interested in using the Groupby
clause on the DataFrame
Details:
a) Following code correctly displays 100 rows with column headers PassengerCount
and TripDistanceMiles
:
%%pyspark
df = spark.read.load('abfss://[email protected]/NYCTripSmall.parquet', format='parquet')
display(df.select("PassengerCount","TripDistanceMiles").limit(100))
b) But the following code does not group by the records and throws error shown above:
%%pyspark
df = spark.read.load('abfss://[email protected]/NYCTripSmall.parquet', format='parquet')
df = df.select("PassengerCount","TripDistanceMiles").limit(100)
display(df.groupBy("PassengerCount").sum(TripDistanceMiles).limit(100))
CodePudding user response:
Try putting the TripDistanceMiles into double quotes. Like
display(df.groupBy("PassengerCount").sum("TripDistanceMiles").limit(100))