Home > Net >  How to add a space between consecutive letters and numbers in PySpark dataframe?
How to add a space between consecutive letters and numbers in PySpark dataframe?

Time:07-25

I have a dataframe which consists of text columns. There are some words which has numbers as well which are followed by words. I want to separate the numbers and words and add a space between them.

For example:

Machine1234 -> Machine 1234
5years -> 5 years

Below is my dataframe

 --- -------------------------------------------- 
|id |words                                       |
 --- -------------------------------------------- 
|0  |This is Spark123 of 5years                  |
|1  |I wish Java DL1234 could use case classes444|
|2  |Data science is  cool321                    |
|3  |Machine345                                  |
 --- -------------------------------------------- 

Below is the code I used but it is not working

df2 = temp.select('id',
    F.regexp_replace('words', r'(\d (\.\d )?)', ' \1').alias('words'))

Desired output:

 --- ---------------------------------------------- 
|id |words                                         |
 --- ---------------------------------------------- 
|0  |This is Spark 123 of 5 years                  |
|1  |I wish Java DL 1234 could use case classes 444|
|2  |Data science is  cool 321                     |
|3  |Machine 345                                   |
 --- ---------------------------------------------- 

CodePudding user response:

F.regexp_replace('words', r'(\p{L} )(\d )', '$1 $2')

\p{L} matches only letters

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(0, 'This is Spark123'),
     (1, 'I wish Java DL1234 could use case classes444'),
     (2, 'Data science is  cool321'),
     (3, 'Machine345')],
    ['id', 'words'])

df = df.withColumn('words', F.regexp_replace('words', r'(\p{L} )(\d )', '$1 $2'))

df.show(truncate=0)
#  --- ---------------------------------------------- 
# |id |words                                         |
#  --- ---------------------------------------------- 
# |0  |This is Spark 123                             |
# |1  |I wish Java DL 1234 could use case classes 444|
# |2  |Data science is  cool 321                     |
# |3  |Machine 345                                   |
#  --- ---------------------------------------------- 
  • Related