Home > Net >  pyspark: for loop calculations over the columns
pyspark: for loop calculations over the columns

Time:08-24

Anyone know how can I do theses calculations in pyspark?

data = {
    'Name': ['Tom', 'nick', 'krish', 'jack'],
    'Age': [20, 21, 19, 18],
    'CSP': [2, 6, 8, 7],
    'coef': [2, 2, 3, 3]
}
  
# Create DataFrame
df = pd.DataFrame(data)
colsToRecalculate = ['Age','CSP']

for i in range(len(colsToRecalculate)):
    df[colsToRecalculate[i]] =df[colsToRecalculate[i]]/df["coef"]

CodePudding user response:

You can use select() on spark dataframe and include multiple columns (with different calculations) as parameters. In your case:

df2 = spark.createDataFrame(pd.DataFrame(data))
df2.select(*[(F.col(c) / F.col('coef')).alias(c) for c in colsToRecalculate], 'coef').show()

CodePudding user response:

Slight variation to bzu's answer which selects non-listed columns manually within the select. We can use dataframe.columns and check the columns against the colsToRecalculate list - If column is in the list, do the calculation, else leave column as is.

data_sdf. \
    select(*[(func.col(k) / func.col('coef')).alias(k) if k in colsToRecalculate else k for k in data_sdf.columns])
  • Related