Notice: this is for Spark
version 2.1.1.2.6.1.0-129
I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL
.
The following is some code that I have written:
my_df = my_df.selectExpr(
" regexp_replace(states, '^-?[0-9] $', '') AS states ",
"someOtherColumn")
This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL
value in the DataFrame.
CodePudding user response:
I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr
from pyspark.sql import functions as F
(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9] $', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)
# Output
# ---------- ------------
# | states|states_fixed|
# ---------- ------------
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# ---------- ------------