I found similar question link , but no answer provided how to fix the issue.
I want to make a UDF, that would extract for me words from column. So, I want to create a column named new_column
, by applying my UDF to old_column
from pyspark.sql.functions import col, regexp_extract
re_string = 'some|words|I|need|to|match'
def regex_extraction(x,re_string):
return regexp_extract(x,re_string,0)
extracting = udf(lambda row: regex_extraction(row,re_string))
df = df.withColumn("new_column", extracting(col('old_column')))
AttributeError: 'NoneType' object has no attribute '_jvm'
How to fix my function? I have many columns and want to loop through columns list and apply my UDF.
CodePudding user response:
You don't need a UDF. UDF is required when you cannot do something using PySpark, so you need some python functions or libraries. In your case your can have a function which accepts a column and returns a column, but that's it, UDF is not needed.
from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([('some match',)], ['old_column'])
re_string = 'some|words|I|need|to|match'
def regex_extraction(x, re_string):
return regexp_extract(x, re_string, 0)
df = df.withColumn("new_column", regex_extraction('old_column', re_string))
df.show()
# ---------- ----------
# |old_column|new_column|
# ---------- ----------
# |some match| some|
# ---------- ----------
"Looping" through columns in a list can be implemented this way:
from pyspark.sql.functions import regexp_extract
cols = ['col1', 'col2']
df = spark.createDataFrame([('some match', 'match')], cols)
re_string = 'some|words|I|need|to|match'
def regex_extraction(x, re_string):
return regexp_extract(x, re_string, 0)
df = df.select(
'*',
*[regex_extraction(c, re_string).alias(f'new_{c}') for c in cols]
)
df.show()
# ---------- ----- -------- --------
# | col1| col2|new_col1|new_col2|
# ---------- ----- -------- --------
# |some match|match| some| match|
# ---------- ----- -------- --------