I have a spark dataframe with approximately 100 columns. NULL instances are currently recorded as \N. I want to replace all instances of \N with NULL however, because the backslash is an escape character, I'm having difficulty. I've found this article that uses regex for a single column, however, I need to iterate over all columns
I've even tried the solution in the article on a single column, however, I still cannot get it to work. Ordinarily, I'd use R and am to solve this issue in R using the following code:
df <- sapply(df,function(x) {x <- gsub("\\\\N",NA,x)})
However, given I'm new to Pyspark I'm having quite a lot of difficulty.
CodePudding user response:
Will this work for you?
import pyspark.sql.functions as psf
data = [ ('0','\\N','\\N','3')]
df = spark.createDataFrame(data, ['col1','col2','col3','col4'])
print('before:')
df.show()
for col in df.columns:
df = df.withColumn(col, psf.when(psf.col(col)==psf.lit('\\N'), psf.lit(None)).otherwise(psf.col(col)))
print('after:')
df.show()
before:
---- ---- ---- ----
|col1|col2|col3|col4|
---- ---- ---- ----
| 0| \N| \N| 3|
---- ---- ---- ----
after:
---- ---- ---- ----
|col1|col2|col3|col4|
---- ---- ---- ----
| 0|null|null| 3|
---- ---- ---- ----