Home > Mobile >  Spark WithColumnRenamed isnt working in for loop
Spark WithColumnRenamed isnt working in for loop

Time:06-16

I have been working with PySpark for years and I never encountred a similar weird behaviour:

I have a bunch of dataframes, lets call them df1, df2 and df3.

I want to rename 2 of their columns identically.

So I created the following function:

def RenameColumns(df):
  return df.withColumnRenamed("A", "AA").withColumnRenamed("B", "BB")

And I wrote the following code next:

l = [df1, df2, df3]
for df in l:
  df = RenameColumns(df)

When I display my dataframes, I still have the old columns names, which means for some reason RenameColumns didn't execute at all.

Replacing my loop with:

df1 = RenameColumns(df1)
df2 = RenameColumns(df2)
df3 = RenameColumns(df3)

works.

Can anyone tell me what is the problem ? I also tried:

def RenameColumns(l):
  for df in l:
    df = df.withColumnRenamed("A", "AA").withColumnRenamed("B", "BB")

l = [df1, df2, df3]
RenameColumns(l)

And same thing, It doesnt rename my columns.

CodePudding user response:

If you want this to work you'll need to reassign the dataframe to the list at the right position for each iteration. However changing the list you're looping over is generally not advised.

You can more easily achieve what you'd like with a list comprehension

l = [RenameColumns(df) for df in l]

CodePudding user response:

What you are trying can happen except in your case you have a def function without return or print statement. For yout loop even if it were to work, it overwrites stored variable.

In such a case I would store the renamed elements in a list or dict and call them from there. I prefer dict because its easier to call with the original df name. Lets try

def rename_df(df):
   return df.toDF('AA','BB')

lst =[df1,df]
l =dict(zip(lst,[rename_df(x).alias('x') for x in lst]))

l.get(df1).show()
  • Related