Home > other >  Count lines of a dataset in function of a column in PySpark
Count lines of a dataset in function of a column in PySpark

Time:07-27

I'm working with PySpark. I have a dataset like this:

I want to count lines of my dataset in function of my "Column3" column. For example, here I want to get this dataset:

enter image description here

CodePudding user response:

temp = spark.createDataFrame([
    (0, 11, 'A'),
    (1, 12, 'B'),
    (2, 13, 'B'),
    (0, 14, 'A'),
    (1, 15, 'c'),
    (2, 16, 'A'),
], ["column1", "column2", 'column3'])

temp.groupBy('column3').agg(count('*').alias('count')).sort('column3').show(10, False)
#  ------- ----- 
# |column3|count|
#  ------- ----- 
# |A      |3    |
# |B      |2    |
# |c      |1    |
#  ------- ----- 

CodePudding user response:

df.groupBy('column_3').count()
  • Related