Home > other >  How do i remove all the dots from a number in a string using regex in a pyspark dataframe?
How do i remove all the dots from a number in a string using regex in a pyspark dataframe?

Time:09-13

Suppose I have a column with the values like this:

column_a
force is N 12.555.335 weight and second force is 12N 67.87.99
weight is 5.704.149 lbs and height is 57.89
5.635.622 lbs of weight 
Acid is 2.395

In all of these rows, I want to remove decimals where there are 3 digits after the decimal. For example, 12.555.335 changes to 12555335, 2.395 changes to 2395. But, the number with only 2 digits after decimal remains the same. So, my final output would be:

column_a
force is N 12555335 weight and second force is 12N 67.87.99
weight is 5704149 lbs and height is 57.89
5635622 lbs of weight 
Acid is 2395

CodePudding user response:

You're going to need a UDF. Therefore, I suggest you try with plain Python first.

For example,

import re

def replace_dots(s):
    matches = re.finditer(regex, s, re.MULTILINE)
    
    data = []
    start = end = None
    for matchNum, match in enumerate(matches):
        start = match.start()
        if end is not None:
            # have data from previous iteration, so add up to the new match
            data.append(s[end:start])
        else:
            # default case, we're starting from the start of the string
            data.append(s[:start])
        end = match.end()
        # add a whole match without the periods
        data.append(s[start:end].replace(".", ''))
        
    # might have left-over data after the final match
    data.append(s[end:])
    
    return ''.join(data)

# decimal followed by 3 or more digits, followed by an optional decimal
regex = r"(.\d{3,}\.?)"
test_strs = [
    "force is N 12.5.33 weight and second force is 12N 67.87.99",
    "weight is 5.704.149 lbs and height is 57.89",
    "5.635.622 lbs of weight",
    "Acid is 2.395"
]

for s in test_strs:
    print(replace_dots(s))

Output

force is N 12555335 weight and second force is 12N 67.87.99
weight is 5704149 lbs and height is 57.89
5635622 lbs of weight
Acid is 2395

Keep in mind, if you had sentences like This number is 12.345., Then that would remove the trailing period...

CodePudding user response:

Do just that. Use look ahead assertion to replace dot only if it is followed by 3 dibgits. Code below

df.withColumn('column_a_edited', regexp_replace('column_a', '\.(?=\d{3})','')).show(truncate=False)


 --- ------------------------------------------------------------- ----------------------------------------------------------- 
|id |column_a                                                     |column_a_edited                                            |
 --- ------------------------------------------------------------- ----------------------------------------------------------- 
|1  |force is N 12.555.335 weight and second force is 12N 67.87.99|force is N 12555335 weight and second force is 12N 67.87.99|
|2  |weight is 5.704.149 lbs and height is 57.89                  |weight is 5704149 lbs and height is 57.89                  |
|3  |5.635.622 lbs of weight                                      |5635622 lbs of weight                                      |
|4  |Acid is 2.395                                                |Acid is 2395                                               |
 --- ------------------------------------------------------------- ----------------------------------------------------------- 
  • Related