How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?-CodePudding

I have an use case to map the elements of a pyspark column based on a condition. Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.

So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe

Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.

@column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
    return column.isin([3])
    # need to replace the above logic with a map function
    # like column.map(lambda x: __valid_date(x))

_spark function arguments are passed from the library

What i have,

A pyspark column with timestamp strings

What i require,

A Pyspark column with boolean(True/False) for each element based on validating the timestamp format

example for dataframe,

df.rdd.map(lambda x: __valid_date(x)).toDF()

__valid_date function returns True/False

So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?

CodePudding user response：

Looks like you need to return a column object that the framework will use for validation. I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:

import pyspark.sql.functions as F
import pyspark.sql.types as T

valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())

@column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
    return valid_date_udf(column)