Home > Software design >  Replace all strings with escape characters across all columns with NULLs in Pyspark
Replace all strings with escape characters across all columns with NULLs in Pyspark

Time:01-18

I have a spark dataframe with approximately 100 columns. NULL instances are currently recorded as \N. I want to replace all instances of \N with NULL however, because the backslash is an escape character, I'm having difficulty. I've found this article that uses regex for a single column, however, I need to iterate over all columns

I've even tried the solution in the article on a single column, however, I still cannot get it to work. Ordinarily, I'd use R and am to solve this issue in R using the following code:

df <- sapply(df,function(x) {x <- gsub("\\\\N",NA,x)})

However, given I'm new to Pyspark I'm having quite a lot of difficulty.

CodePudding user response:

Will this work for you?

import pyspark.sql.functions as psf

data = [ ('0','\\N','\\N','3')]
df = spark.createDataFrame(data, ['col1','col2','col3','col4'])

print('before:')
df.show()

for col in df.columns:
    df = df.withColumn(col, psf.when(psf.col(col)==psf.lit('\\N'), psf.lit(None)).otherwise(psf.col(col)))

print('after:')
df.show()

before:
 ---- ---- ---- ---- 
|col1|col2|col3|col4|
 ---- ---- ---- ---- 
|   0|  \N|  \N|   3|
 ---- ---- ---- ---- 

after:
 ---- ---- ---- ---- 
|col1|col2|col3|col4|
 ---- ---- ---- ---- 
|   0|null|null|   3|
 ---- ---- ---- ---- 
  • Related