I am writing a multi line statement in pyspark. I have a dataframe 'current' which I have grouped by 'major' and created a new column called 'n_students' to count the number of students in each major. I would like to then create another new column called prop where I divide the number of n_students in each major by the total number of students. The total number of students is contained in the variable current_students. The total number of current students is 2055. You can see within the statement where I have just used the number 2055 as my denominator. How do I change the denominator to be the count in the variable current_students?
current_students=current.count()
print(current_students)
2055
(
current
.groupBy('major')
.agg(
expr('count(*) AS n_students')
)
.select(
'major', 'n_students',
expr('ROUND(n_students/2055,4) AS prop')
)
.sort('prop', ascending=False)
.show())
----- ---------- ------
|major|n_students| prop|
----- ---------- ------
| BIO| 615|0.2993|
| CSC| 508|0.2472|
| CHM| 405|0.1971|
| MTH| 320|0.1557|
| PHY| 207|0.1007|
----- ---------- ------
I would like to get this exact output but instead of using the number 2055 as my denominator, I would instead like to pull the number in from the variable current_students.
current_students=current.count()
(
current
.groupBy('major')
.agg(
expr('count(*) AS n_students')
)
.select(
'major', 'n_students',
expr('ROUND(n_students/##CHANGE TO PULL FROM VARIABLE Current_students##,4) AS prop')
)
.sort('prop', ascending=False)
.show())
----- ---------- ------
|major|n_students| prop|
----- ---------- ------
| BIO| 615|0.2993|
| CSC| 508|0.2472|
| CHM| 405|0.1971|
| MTH| 320|0.1557|
| PHY| 207|0.1007|
----- ---------- ------
CodePudding user response:
use python's string format()
method to input any variable's value in the string.
current_students = current.count()
func.expr('ROUND(n_students/{0}, 4) AS prop'.format(current_students))
# Column<'ROUND((n_students / 2055), 4) AS `prop`'>
you could also use native func.col
& func.lit
instead of the expr
func.round(('n_students' / func.lit(current_students)), 4).alias('prop')
# Column<'round((n_students / 2055), 4) AS `prop`'>