Home > Back-end >  Is there a way to get the dtype of a pyspark.sql.column.Column without first calling it on a pyspark
Is there a way to get the dtype of a pyspark.sql.column.Column without first calling it on a pyspark

Time:11-21

This might be a niche question, but imagine that you have a udf defined like this:

import pyspark.sql.functions as sf
import pyspark.sql.types as st
@sf.udf(returnType=st.ArrayType(st.StringType()))
def some_function(text: str) -> List[str]:
    return text.split(' ')

This returns an udf, of which I need to know it's returnType. Is there a way to get the return type:

  • Without calling the udf on a pyspark.sql.DataFrame and using the dtypes() function on the result
  • Without storing the returnType for this function in a separate place

Context: I want to give an .alias to the pyspark.sql.column.Column that is returned by the udf, but the alias should depend upon its type.

So in dummy code the desired result would be:

input_column_name = 'some_text_column'
expr = some_udf_function(sf.col(input_column_name))
dtype_abbreviation = get_dtype_return_type_abbreviation(expr) 
expr_renamed = expr.alias(input_column_name   '_'   dtype_abbreviation)

Where the desired return of get_dtype_return_type_abbreviation would be for example 'list_of_strings' for an udf that returns st.ArrayType(st.StringType()). The alias in this case would be 'some_text_column_list_of_lists'.

CodePudding user response:

You can access the returnType property of the udf

import pyspark.sql.functions as sf
import pyspark.sql.types as st
from typing import List

@sf.udf(returnType=st.ArrayType(st.StringType()))
def some_function(text: str) -> List[str]:
    return text.split(' ')

print(some_function.returnType)

# output
ArrayType(StringType,true)
  • Related