Home > Software design >  Split column of pyspark dataframe bases on '%' symbol
Split column of pyspark dataframe bases on '%' symbol

Time:09-16

I am trying to split the data of column based on '%' symbol. But some of the data that I have does not contain '%' symbol.

Input data

|Default_value      |
 ------------------- 
|       10% OF VALUE|
|       20% OF VALUE|
| This is null VALUE|
|     0 is the value|
 ------------------- 

Expected output

|value              | Description       |
 ------------------- ------------------- 
|                10%|   OF VALUE        |
|                20%|   OF VALUE        |
|                   |This is null VALUE |
|                   | 0 is the value    |
 ------------------- ------------------- 

I tried with regex on '%' but the row which does not have '%' is coming under 'value' column and I want that in 'Description' column.

CodePudding user response:

You can use regexp_extract function.

df = spark.createDataFrame(['10% OF VALUE', '20% OF VALUE', 'This is null VALUE', '0 is the value'], StringType()) \
         .toDF('Default_value')

df.withColumn('value', regexp_extract('Default_value', '.*%', 0)) \
  .withColumn('Description', regexp_extract('Default_value', '(.*%|.{0})(.*)', 2)).show()

 ------------------ ----- ------------------ 
|     Default_value|value|       Description|
 ------------------ ----- ------------------ 
|      10% OF VALUE|  10%|          OF VALUE|
|      20% OF VALUE|  20%|          OF VALUE|
|This is null VALUE|     |This is null VALUE|
|    0 is the value|     |    0 is the value|
 ------------------ ----- ------------------ 
  • Related