Home > Blockchain >  I want to write a function to manipulate a spark-dataframe column value
I want to write a function to manipulate a spark-dataframe column value

Time:12-15

Code to create dataframe:

source_df = spark.createDataFrame(
    [
        ("Jose", "BLUE"),
        ("lI", "BrOwN")
    ],
    ["name", "eye_color"]
)

I have written following code to convert the 'eye-color' column to lowercase:

actual_df = source_df
for col_name in actual_df.columns if column == 'eye_color' else column for column in actual_df.columns:
    actual_df = actual_df.withColumn(col_name, lower(col(col_name)))

I am getting following error:

Cell In [26], line 2
  for col_name in actual_df.columns if column == 'eye_color' else column for column in actual_df.columns:
                                                                           ^
SyntaxError: invalid syntax

CodePudding user response:

This is more a python problem than a spark problem. Your python syntax is not correct.

If you want to keep the same structure, that is make a transformation for each column that matches some criteria, there are multiple ways to do it:

# using a if
for col_name in actual_df.columns:
    if col_name == 'eye_color':
        actual_df = actual_df.withColumn(col_name, lower(col(col_name)))

# using filter
for col_name in filter(lambda column: column == 'eye_color', actual_df.columns):
    actual_df = actual_df.withColumn(col_name, lower(col(col_name)))

# using list comprehension
for col_name in [column for column in actual_df.columns if column == 'eye_color']:
    actual_df = actual_df.withColumn(col_name, lower(col(col_name)))

But in your situation, as mentioned in one comment, since you only transform one column I would not use a loop. A single withColumn would do the trick.

source_df.withColumn('eye_color', lower(col('eye_color')))
  • Related