Home > OS >  How to add a space between the words when there is a special character in pyspark dataframe using re
How to add a space between the words when there is a special character in pyspark dataframe using re

Time:07-25

I have a dataframe which consists of reviews and has special characters in between the words. I want to add a space.

For example,

Spark)NLP -> Spark ) NLP Machine-Learning -> Machine - Learning

Below is my dataframe

temp = spark.createDataFrame([
    (0, "This is 5years of Spark)world 5-6"),
    (1, "I wish Java-DL could use case-classes"),
    (2, "Data-science is  cool"),
    (3, "Machine")
], ["id", "words"])


 --- ------------------------------------- 
|id |words                                |
 --- ------------------------------------- 
|0  |This is 5years of Spark)world 5-6    |
|1  |I wish Java-DL could use case-classes|
|2  |Data-science is  cool                |
|3  |Machine                              |
 --- ------------------------------------- 

I have used the below code to do that but it is not working

temp_1 = temp.withColumn('words', F.regexp_replace('words', r'(?<! )(?=[.,!?()\/\-\ \'])|(?<=[.,!?()\/\-\ \'])(?! )', '$1 $2 $3'))

Desired output:

 --- ----------------------------------------- 
|id |words                                    |
 --- ----------------------------------------- 
|0  |This is 5years of Spark ) world 5 - 6    |
|1  |I wish Java - DL could use case - classes|
|2  |Data - science is  cool                  |
|3  |Machine                                  |
 --- ----------------------------------------- 

CodePudding user response:

You can use

\b[^\w\s]\b|_

And replace with $0 . See the regex demo.

If you do not consider an underscore to be a special char, just use \b[^\w\s]\b that matches any char other than word and whitespace chars between word chars. Note word chars include underscores.

If there must be letters or digits on each side, replace word boundaries with lookarounds: (?<=[^\W_])[^\w\s](?=[^\W_])|_. To only find special chars between letters: (?<=[^\W\d_])[^\w\s](?=[^\W\d_])|_ or (?<=\p{L})[^\w\s](?=\p{L})|_.

  • Related