a
is my PySpark dataframe with columns as "prsn" and other such as "x", "y", etc.
I am working on Spark 3 and applying the following command:
z = a.groupBy('x').agg(F.count('prsn').alias('b'))
This function is throwing a type error saying count() only takes 1 arguments but 2 were given. But here I am only giving 1 argument. Why it is considering it to be 2?
CodePudding user response:
It's not directly obvious, but PySpark has some magic to automatically inject the spark session as an argument (even when you don't pass it).
Therefore, the error you are getting is indicating that you should not pass any argument. If you reason about count
in general, you are not counting a specific column. You are trying to count how many rows you have in the entire DataFrame. Hence, whether you are counting that column, or another column, doesn't matter :).
CodePudding user response:
I tried replicating your issue, but I couldn't. The following works well:
from pyspark.sql import functions as F
a = spark.createDataFrame([('x', 'p1'), ('x', 'p2'), ('x', 'p3')], ['x', 'prsn'])
a.groupBy('x').agg(F.count('prsn').alias('b')).show()
# --- ---
# | x| b|
# --- ---
# | x| 3|
# --- ---
It probably depends on how your dataframe was created, so it would need more details to be able to answer precisely.
You may try providing F.lit(1)
to the count
function:
a.groupBy('x').agg(F.count(F.lit(1)).alias('b')).show()
# --- ---
# | x| b|
# --- ---
# | x| 3|
# --- ---