I have a dataframe which consists of reviews and has special characters in between the words. I want to add a space.
For example,
Spark)NLP -> Spark ) NLP Machine-Learning -> Machine - Learning
Below is my dataframe
temp = spark.createDataFrame([
(0, "This is 5years of Spark)world 5-6"),
(1, "I wish Java-DL could use case-classes"),
(2, "Data-science is cool"),
(3, "Machine")
], ["id", "words"])
--- -------------------------------------
|id |words |
--- -------------------------------------
|0 |This is 5years of Spark)world 5-6 |
|1 |I wish Java-DL could use case-classes|
|2 |Data-science is cool |
|3 |Machine |
--- -------------------------------------
I have used the below code to do that but it is not working
temp_1 = temp.withColumn('words', F.regexp_replace('words', r'(?<! )(?=[.,!?()\/\-\ \'])|(?<=[.,!?()\/\-\ \'])(?! )', '$1 $2 $3'))
Desired output:
--- -----------------------------------------
|id |words |
--- -----------------------------------------
|0 |This is 5years of Spark ) world 5 - 6 |
|1 |I wish Java - DL could use case - classes|
|2 |Data - science is cool |
|3 |Machine |
--- -----------------------------------------
CodePudding user response:
You can use
\b[^\w\s]\b|_
And replace with $0
. See the regex demo.
If you do not consider an underscore to be a special char, just use \b[^\w\s]\b
that matches any char other than word and whitespace chars between word chars. Note word chars include underscores.
If there must be letters or digits on each side, replace word boundaries with lookarounds: (?<=[^\W_])[^\w\s](?=[^\W_])|_
. To only find special chars between letters: (?<=[^\W\d_])[^\w\s](?=[^\W\d_])|_
or (?<=\p{L})[^\w\s](?=\p{L})|_
.