How to loop over a RDD by index in Pyspark and replace column values?-CodePudding

I am trying to accompany the following I have done with a regular df in pandas with a RDD or Spark DF in pyspark. I was trying to solve it with the foreach() function, but I have failed at all attempts. Has somebody got a neat solution for it?

for i in range(len(all_songs)):

   if all_songs['loudness'][i] >0:
        loudness = all_songs.loc[i, 'loudness']
        all_songs['loudness'][i] = loudness * -1

Thank you very much!

CodePudding user response：

I am not sure if pure DataFrame API solution is valid for your case, but I would achieve what you have described with the following code:

from pyspark.sql.functions import when, col

# Assume that `df` is your DataFrame
replaced_df = df.withColumn("loudness", when(col("loudness") > 0, col("loudness") * -1).otherwise(col("loudness")))