PySpark remove leading 0s from data frame column unless the value is just 0-CodePudding

(1) 00781 should be -> 781

(2) 00001 should be -> 1

(3) 00000 should be -> 0 (just one 0 for this special case)

(4) 10101 should be -> 10101 (unchanged)

What I have so far is

from pyspark.sql import functions as F
df = df.withColumn('ID',F.regexp_replace('ID', r'^[0]*',''))

But this is not entirely right as it does not work for the 00000 -> 0 case

CodePudding user response：

from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
df = spark.createDataFrame([('01001',),('00056',),('00000',),('10101',)],['id'])
df.withColumn('new_id',col('id').cast(IntegerType())).show()

 ----- ------ 
|   id|new_id|
 ----- ------ 
|01001|  1001|
|00056|    56|
|00000|     0|
|10101| 10101|
 ----- ------

CodePudding user response：

I find it easier to rather regexp_extract it, although there's almost certainly a way to regexp_replace.

df.withColumn('id2', f.regexp_extract(f.col('ID'), '^0*(([^0].*)|0$)', 1)).show()

 ----- ----- 
|   ID|  id2|
 ----- ----- 
|00781|  781|
|00001|    1|
|00000|    0|
|10101|10101|
 ----- -----

That regex simply targets either the substring starting at the first non-0 or the last 0 (if there are only 0s).

The result of that is