Home > other >  Replace spaces with underscores inside array elements in PySpark
Replace spaces with underscores inside array elements in PySpark

Time:06-06

I have a Spark dataframe:

id objects
1 [sun, solar system, mars, milky way]
2 [moon, cosmic rays, orion nebula]

I need to replace space with underscore in array elements.

Expected result:

id objects concat_obj
1 [sun, solar system, mars, milky way] [sun, solar_system, mars, milky_way]
2 [moon, cosmic rays, orion nebula] [moon, cosmic_rays, orion_nebula]

I tried using regexp_replace:

df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))

but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?

CodePudding user response:

You could use the following regex:

`(?<=[A-Za-z]) `

The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.

Try it here.

CodePudding user response:

Use highe order functions to replace white space through regexp_replace

schema

root
 |-- id: long (nullable = true)
 |-- objects: array (nullable = true)
 |    |-- element: string (containsNull = true)

solution

df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)

 --- ------------------------------------ ------------------------------------ 
|id |objects                             |concat_obj                          |
 --- ------------------------------------ ------------------------------------ 
|1  |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2  |[moon, cosmic rays, orion nebula]   |[moon, cosmic_rays, orion_nebula]   |
 --- ------------------------------------ ------------------------------------ 
  • Related