Home > Software design >  PYSPARK How to use a variable created when dividing within a multi-line statement
PYSPARK How to use a variable created when dividing within a multi-line statement

Time:11-07

I am writing a multi line statement in pyspark. I have a dataframe 'current' which I have grouped by 'major' and created a new column called 'n_students' to count the number of students in each major. I would like to then create another new column called prop where I divide the number of n_students in each major by the total number of students. The total number of students is contained in the variable current_students. The total number of current students is 2055. You can see within the statement where I have just used the number 2055 as my denominator. How do I change the denominator to be the count in the variable current_students?

current_students=current.count()
print(current_students)
2055 


(
    current
    .groupBy('major')
    .agg(
        expr('count(*) AS n_students')
    )
    .select(
        'major', 'n_students',
        expr('ROUND(n_students/2055,4) AS prop')
           )
    .sort('prop', ascending=False)
.show())
 ----- ---------- ------ 
|major|n_students|  prop|
 ----- ---------- ------ 
|  BIO|       615|0.2993|
|  CSC|       508|0.2472|
|  CHM|       405|0.1971|
|  MTH|       320|0.1557|
|  PHY|       207|0.1007|
 ----- ---------- ------ 

I would like to get this exact output but instead of using the number 2055 as my denominator, I would instead like to pull the number in from the variable current_students.

current_students=current.count()

(
    current
    .groupBy('major')
    .agg(
        expr('count(*) AS n_students')
    )
    .select(
        'major', 'n_students',
        expr('ROUND(n_students/##CHANGE TO PULL FROM VARIABLE Current_students##,4) AS prop')
           )
    .sort('prop', ascending=False)
.show())
 ----- ---------- ------ 
|major|n_students|  prop|
 ----- ---------- ------ 
|  BIO|       615|0.2993|
|  CSC|       508|0.2472|
|  CHM|       405|0.1971|
|  MTH|       320|0.1557|
|  PHY|       207|0.1007|
 ----- ---------- ------ 

CodePudding user response:

use python's string format() method to input any variable's value in the string.

current_students = current.count()

func.expr('ROUND(n_students/{0}, 4) AS prop'.format(current_students))
# Column<'ROUND((n_students / 2055), 4) AS `prop`'>

you could also use native func.col & func.lit instead of the expr

func.round(('n_students' / func.lit(current_students)), 4).alias('prop')
# Column<'round((n_students / 2055), 4) AS `prop`'>
  • Related