I want to use regexp_replace()
in PySpark to convert all question marks and back slashes in my data frame to null values. This is the code I used:
question = "?"
empty_str = "\\\"\\\""
for column in df.columns:
df = df.withColumn(column, regexp_replace(column, question, None)
df = df.withColumn(column, regexp_replace(column, empty_str, None)
However, when I use this code all the values in my dataframe turn into null values - not just the question marks and back slashes. Is there a way I can change my code to fix this?
CodePudding user response:
With regexp_replace
you cannot replace values to null, you will need another method, e.g. replace
from pyspark.sql import functions as F
df = spark.createDataFrame([("?",), ("\\",), ("b",)], ["col_name"])
df.show()
# --------
# |col_name|
# --------
# | ?|
# | \|
# | b|
# --------
pattern = r"^[?\\] $"
df = df.withColumn("col_name", F.regexp_replace("col_name", pattern, "")) \
.replace("", None, "col_name")
df.show()
# --------
# |col_name|
# --------
# | null|
# | null|
# | b|
# --------
In your attempt, every value changed to null, because you incorrectly provided None to the replacement
argument, instead of str
. Only str is accepted, according to the documentation.
pyspark.sql.functions.regexp_replace(str: ColumnOrName, pattern: str, replacement: str) → pyspark.sql.column.Column
CodePudding user response:
it is working like this, ou have to use \\ to replace backslash with null and ? to replace Question mark with null
>>> df.show(truncate=False)
-------------------------------------------------
|_c0 |
-------------------------------------------------
|"{""id"\":""e5?2f247c-f46c-4021-bc62-e28e56db1ad8|
-------------------------------------------------
>>> df.withColumn("_c0",regexp_replace('_c0','\\\\','')).withColumn("_c0",regexp_replace('_c0','\?','')).show(truncate=False)
-----------------------------------------------
|_c0 |
-----------------------------------------------
|"{""id"":""e52f247c-f46c-4021-bc62-e28e56db1ad8|
-----------------------------------------------