I have a file with 3 columns with data in every column. The data is in format below :
Every field is enclosed with backspaces like: BSC123BSC (here BSC is a backspace character). Columns' values contain new line and carriage return characters. Columns are delimited by Escape character.
I am not able to find the regex pattern to replace all three mentioned characters.
CodePudding user response:
A simple [\b\n\r]
should work.
>>> data = spark.createDataFrame([('086261636b73706163650d43520a4c4608',),('080d0a08',)],['hexstring']).withColumn('col',decode(unhex('hexstring'),'UTF-8')).drop('hexstring')
>>> cleansed = data.withColumn('regexed',regexp_replace('col','[\b\n\r]','*'))
>>> cleansed.select('regexed').show()
-----------------
| regexed|
-----------------
|*backspace*CR*LF*|
| ****|
-----------------
>>>