Home > Enterprise >  How to create a new column in PySpark based on a dynamic condition
How to create a new column in PySpark based on a dynamic condition

Time:06-19

I need to create a new column in PySpark Dataframe. However the condition to create this new column will be dynamic

example:

df = df.withColumn(
                'update_date',
                to_date(
                    substring(df['update_date_string'], -8, 8),
                    'MM-dd-yy',
                ),
            )

To be converted to

column_expression = "to_date(
                    substring(df['update_date_string'], -8, 8),
                    'MM-dd-yy',
                )"
df = df.withColumn(
                'update_date',
                expr(column_expression )
            )

The second code with expr() is not creating the new column. Please suggest how this could be resolved.

CodePudding user response:

In expr() you need pass SQL expression, not python (Docs: https://sparkbyexamples.com/pyspark/pyspark-sql-expr-expression-function/). Try

column_expression = "to_date(
                    substring(update_date_string, -8, 8),
                    'MM-dd-yy')"
df = df.withColumn(
                'update_date',
                expr(column_expression )
            )
  • Related