Guys I have the input string 00000000995510.32 and I need to remove the sign and the leading zeros, my output number should be: 995510.32.
Is there a regular expression to do this in regexp_replace
?
My current code:
df.withColumn("vl_fat",regexp_replace(col("vl_fat"),"^([0-9]|[1-9][0-9])$ ", ""))
but that didn't work
CodePudding user response:
if you want to practise regex, try: https://regex101.com/. The pattern you describe is that it starts with one
and then with a zero to many amount of 0
, which in python regex would be [ ][0]*
. You also need to consider the look ahead feature of regex that can get a little weird. This should work however:
(?![ ])(?![0]).*
CodePudding user response:
you can use this regex "\ 0 "
to catch the leading 000...
explanation from regex101 :
\
matches the character0
matches the character0
literally (case sensitive)
CodePudding user response:
My two cents: you can use regex_extract
(that seems to suite better your use case) and convert the input string into a float:
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame(
[(' 00000000995510.32',),
('34.32',),
(' 00000.34',),
(' 0444444',),
('9.',)],
T.StructType([
T.StructField('input_string', T.StringType())
])
)
df.withColumn('parsed_float',
F.regexp_extract('input_string', '^(\ 0 |)(\d (\.\d*|))$', 2).cast(T.FloatType()))
This is what you get:
------------------ ------------
| input_string|parsed_float|
------------------ ------------
| 00000000995510.32| 995510.3|
| 34.32| 34.32|
| 00000.34| 0.34|
| 0444444| 444444.0|
| 9.| 9.0|
------------------ ------------
For the regex:
(\ 0 |)
: this captures the initial (optional)0
(\d (\.\d*|))
: this captures the whole figure, described as a sequence of numbers followed by a (optional) sequence composed of a.
followed by any number of decimals
The second argument of regex_extract
is the group you are interested into; in this case is the second one, i.e., (\d (\.\d*|))
.
CodePudding user response:
Instead of regex you might like to use TRIM
. I find this easier to read and it better conveys the intention of the code. Note this code will also remove any
signs directly after your leading zeros.
import pyspark.sql.functions as F
df = spark.createDataFrame([(' 00000000995510.32',)], ['number'])
df.withColumn('trimmed', F.expr("TRIM(LEADING ' 0' FROM number)")).show()
------------------ ---------
| number| trimmed|
------------------ ---------
| 00000000995510.32|995510.32|
------------------ ---------
Or if you want an actual number, you could simply cast it to float (or decimal). Note any value which cannot be cast will become NULL
.
df.withColumn('trimmed', F.col('number').cast('float')).show()
------------------ --------
| number| trimmed|
------------------ --------
| 00000000995510.32|995510.3|
------------------ --------