Home > other >  How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?
How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?

Time:11-20

I have an use case to map the elements of a pyspark column based on a condition. Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.

So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe

Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.

@column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
    return column.isin([3])
    # need to replace the above logic with a map function
    # like column.map(lambda x: __valid_date(x))

_spark function arguments are passed from the library

What i have,

A pyspark column with timestamp strings

What i require,

A Pyspark column with boolean(True/False) for each element based on validating the timestamp format

example for dataframe,

df.rdd.map(lambda x: __valid_date(x)).toDF()

__valid_date function returns True/False

So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?

CodePudding user response:

Looks like you need to return a column object that the framework will use for validation. I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:

import pyspark.sql.functions as F
import pyspark.sql.types as T

valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())

@column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
    return valid_date_udf(column)
  • Related