Home > Net >  Replace Newline character, Backspace character and carriage return character in pyspark dataframe
Replace Newline character, Backspace character and carriage return character in pyspark dataframe

Time:07-19

I have a file with 3 columns with data in every column. The data is in format below :

Every field is enclosed with backspaces like: BSC123BSC (here BSC is a backspace character). Columns' values contain new line and carriage return characters. Columns are delimited by Escape character.

I am not able to find the regex pattern to replace all three mentioned characters.

CodePudding user response:

A simple [\b\n\r] should work.

>>> data = spark.createDataFrame([('086261636b73706163650d43520a4c4608',),('080d0a08',)],['hexstring']).withColumn('col',decode(unhex('hexstring'),'UTF-8')).drop('hexstring')
>>> cleansed = data.withColumn('regexed',regexp_replace('col','[\b\n\r]','*'))                                                              
>>> cleansed.select('regexed').show()
 -----------------                                                              
|          regexed|
 ----------------- 
|*backspace*CR*LF*|
|             ****|
 ----------------- 

>>> 
  • Related