I have a dataframe which consists of text columns. There are some words which has numbers as well which are followed by words. I want to separate the numbers and words and add a space between them.
For example:
Machine1234 -> Machine 1234
5years -> 5 years
Below is my dataframe
--- --------------------------------------------
|id |words |
--- --------------------------------------------
|0 |This is Spark123 of 5years |
|1 |I wish Java DL1234 could use case classes444|
|2 |Data science is cool321 |
|3 |Machine345 |
--- --------------------------------------------
Below is the code I used but it is not working
df2 = temp.select('id',
F.regexp_replace('words', r'(\d (\.\d )?)', ' \1').alias('words'))
Desired output:
--- ----------------------------------------------
|id |words |
--- ----------------------------------------------
|0 |This is Spark 123 of 5 years |
|1 |I wish Java DL 1234 could use case classes 444|
|2 |Data science is cool 321 |
|3 |Machine 345 |
--- ----------------------------------------------
CodePudding user response:
F.regexp_replace('words', r'(\p{L} )(\d )', '$1 $2')
\p{L}
matches only letters
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(0, 'This is Spark123'),
(1, 'I wish Java DL1234 could use case classes444'),
(2, 'Data science is cool321'),
(3, 'Machine345')],
['id', 'words'])
df = df.withColumn('words', F.regexp_replace('words', r'(\p{L} )(\d )', '$1 $2'))
df.show(truncate=0)
# --- ----------------------------------------------
# |id |words |
# --- ----------------------------------------------
# |0 |This is Spark 123 |
# |1 |I wish Java DL 1234 could use case classes 444|
# |2 |Data science is cool 321 |
# |3 |Machine 345 |
# --- ----------------------------------------------