Anyone know how can I do theses calculations in pyspark?
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18],
'CSP': [2, 6, 8, 7],
'coef': [2, 2, 3, 3]
}
# Create DataFrame
df = pd.DataFrame(data)
colsToRecalculate = ['Age','CSP']
for i in range(len(colsToRecalculate)):
df[colsToRecalculate[i]] =df[colsToRecalculate[i]]/df["coef"]
CodePudding user response:
You can use select()
on spark dataframe and include multiple columns (with different calculations) as parameters. In your case:
df2 = spark.createDataFrame(pd.DataFrame(data))
df2.select(*[(F.col(c) / F.col('coef')).alias(c) for c in colsToRecalculate], 'coef').show()
CodePudding user response:
Slight variation to bzu's answer which selects non-listed columns manually within the select
. We can use dataframe.columns
and check the columns against the colsToRecalculate
list - If column is in the list, do the calculation, else leave column as is.
data_sdf. \
select(*[(func.col(k) / func.col('coef')).alias(k) if k in colsToRecalculate else k for k in data_sdf.columns])