Currently, I am working with PySpark to analyze some data. I have a CSV file with Payroll data in it. I want to know what Job has the best pay. To do that I need the median() because I want to know the average.
The methods for groupBy
in Pyspark are these:
agg
, avg
, count
, max
, mean
, min
, pivot
, sum
When I try the .mean()
method it looks like this:
mean_pay_data = reduced_data.groupBy("JOB_TITLE").mean("REGULAR_PAY")
mean_pay_data.show(3)
# -------------------- -----------------
# | JOB_TITLE| avg(REGULAR_PAY)|
# -------------------- -----------------
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# -------------------- -----------------
Here is what it looks like with the .avg()
method:
average_pay_data = reduced_data.groupBy("JOB_TITLE").avg("REGULAR_PAY")
average_pay_data.show(3)
# -------------------- -----------------
# | JOB_TITLE| avg(REGULAR_PAY)|
# -------------------- -----------------
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# -------------------- -----------------
They return the exact same values. What's the difference between mean()
and avg()
?
I also want to find the median, so that one person doesn't have too much of an impact.
Since there is no median()
method in PySpark I don't know what to do here.
CodePudding user response:
Both avg
and mean
documentation tell this:
mean()
is an alias foravg()
Both of these functions are identical. Both names are needed, so that developers coming from different backgrounds would feel comfortable.
Regarding the median:
Approximate (efficient) median:
F.expr('percentile_approx(col_name, .5) over()')
Accurate (inefficient) median:
F.expr('percentile(col_name, .5) over()')