I have a Spark dataframe:
id | objects |
---|---|
1 | [sun, solar system, mars, milky way] |
2 | [moon, cosmic rays, orion nebula] |
I need to replace space with underscore in array elements.
Expected result:
id | objects | concat_obj |
---|---|---|
1 | [sun, solar system, mars, milky way] | [sun, solar_system, mars, milky_way] |
2 | [moon, cosmic rays, orion nebula] | [moon, cosmic_rays, orion_nebula] |
I tried using regexp_replace
:
df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))
but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?
CodePudding user response:
You could use the following regex:
`(?<=[A-Za-z]) `
The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.
Try it here.
CodePudding user response:
Use highe order functions to replace white space through regexp_replace
schema
root
|-- id: long (nullable = true)
|-- objects: array (nullable = true)
| |-- element: string (containsNull = true)
solution
df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)
--- ------------------------------------ ------------------------------------
|id |objects |concat_obj |
--- ------------------------------------ ------------------------------------
|1 |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2 |[moon, cosmic rays, orion nebula] |[moon, cosmic_rays, orion_nebula] |
--- ------------------------------------ ------------------------------------