(1) 00781 should be -> 781
(2) 00001 should be -> 1
(3) 00000 should be -> 0 (just one 0 for this special case)
(4) 10101 should be -> 10101 (unchanged)
What I have so far is
from pyspark.sql import functions as F
df = df.withColumn('ID',F.regexp_replace('ID', r'^[0]*',''))
But this is not entirely right as it does not work for the 00000 -> 0 case
CodePudding user response:
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
df = spark.createDataFrame([('01001',),('00056',),('00000',),('10101',)],['id'])
df.withColumn('new_id',col('id').cast(IntegerType())).show()
----- ------
| id|new_id|
----- ------
|01001| 1001|
|00056| 56|
|00000| 0|
|10101| 10101|
----- ------
CodePudding user response:
I find it easier to rather regexp_extract
it, although there's almost certainly a way to regexp_replace
.
df.withColumn('id2', f.regexp_extract(f.col('ID'), '^0*(([^0].*)|0$)', 1)).show()
----- -----
| ID| id2|
----- -----
|00781| 781|
|00001| 1|
|00000| 0|
|10101|10101|
----- -----
That regex simply targets either the substring starting at the first non-0
or the last 0
(if there are only 0
s).
The result of that is