I'm running the following block and I'm wondering why .alias
is not working:
data = [(1, "siva", 100), (2, "siva", 200),(3, "siva", 300),
(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
display(df.select('name').groupby('name').count().alias('test'))
Is there a specific reason? In which case .alias()
is supposed to be working in a similar situation? Also why no errors are being returned?
CodePudding user response:
You could change syntax a bit to apply alias with no issue:
from pyspark.sql import functions as F
df.select('name').groupby('name').agg(F.count("name").alias("test")).show()
# output
----- ----
| name|test|
----- ----
|siva4| 1|
|siva5| 1|
| siva| 3|
----- ----
I am not 100% sure, but my understanding is that when you use .count()
it returns entire Dataframe so in fact .alias()
is applied to entire Dataset instead of single column that's why it does not work.