Replace spaces with underscores inside array elements in PySpark-CodePudding

I have a Spark dataframe:

id	objects
1	[sun, solar system, mars, milky way]
2	[moon, cosmic rays, orion nebula]

I need to replace space with underscore in array elements.

Expected result:

id	objects	concat_obj
1	[sun, solar system, mars, milky way]	[sun, solar_system, mars, milky_way]
2	[moon, cosmic rays, orion nebula]	[moon, cosmic_rays, orion_nebula]

I tried using regexp_replace:

df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))

but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?

CodePudding user response：

You could use the following regex:

`(?<=[A-Za-z]) `

The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.

Try it here.

CodePudding user response：

Use highe order functions to replace white space through regexp_replace

schema

root
 |-- id: long (nullable = true)
 |-- objects: array (nullable = true)
 |    |-- element: string (containsNull = true)

solution

df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)

 --- ------------------------------------ ------------------------------------ 
|id |objects                             |concat_obj                          |
 --- ------------------------------------ ------------------------------------ 
|1  |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2  |[moon, cosmic rays, orion nebula]   |[moon, cosmic_rays, orion_nebula]   |
 --- ------------------------------------ ------------------------------------