Suppose I have a column (in PySpark) with the values like this:
column_a
force is N 12.555.335 weight and second force is 12N 67.87.99
weight is 5.704.149 lbs and height is 57.89
5.635.622 lbs of weight
Acid is 2.395
In all of these rows, I want to remove decimals where there are 3 digits after the decimal. For example, 12.555.335 changes to 12555335, 2.395 changes to 2395. But, the number with only 2 digits after decimal remains the same. So, my final output would be:
column_a
force is N 12555335 weight and second force is 12N 67.87.99
weight is 5704149 lbs and height is 57.89
5635622 lbs of weight
Acid is 2395
CodePudding user response:
You're going to need a UDF. Therefore, I suggest you try with plain Python first.
For example,
import re
def replace_dots(s):
matches = re.finditer(regex, s, re.MULTILINE)
data = []
start = end = None
for matchNum, match in enumerate(matches):
start = match.start()
if end is not None:
# have data from previous iteration, so add up to the new match
data.append(s[end:start])
else:
# default case, we're starting from the start of the string
data.append(s[:start])
end = match.end()
# add a whole match without the periods
data.append(s[start:end].replace(".", ''))
# might have left-over data after the final match
data.append(s[end:])
return ''.join(data)
# decimal followed by 3 or more digits, followed by an optional decimal
regex = r"(.\d{3,}\.?)"
test_strs = [
"force is N 12.5.33 weight and second force is 12N 67.87.99",
"weight is 5.704.149 lbs and height is 57.89",
"5.635.622 lbs of weight",
"Acid is 2.395"
]
for s in test_strs:
print(replace_dots(s))
Output
force is N 12555335 weight and second force is 12N 67.87.99
weight is 5704149 lbs and height is 57.89
5635622 lbs of weight
Acid is 2395
Keep in mind, if you had sentences like This number is 12.345.
, Then that would remove the trailing period...
CodePudding user response:
Do just that. Use look ahead assertion to replace dot only if it is followed by 3 dibgits. Code below
df.withColumn('column_a_edited', regexp_replace('column_a', '\.(?=\d{3})','')).show(truncate=False)
--- ------------------------------------------------------------- -----------------------------------------------------------
|id |column_a |column_a_edited |
--- ------------------------------------------------------------- -----------------------------------------------------------
|1 |force is N 12.555.335 weight and second force is 12N 67.87.99|force is N 12555335 weight and second force is 12N 67.87.99|
|2 |weight is 5.704.149 lbs and height is 57.89 |weight is 5704149 lbs and height is 57.89 |
|3 |5.635.622 lbs of weight |5635622 lbs of weight |
|4 |Acid is 2.395 |Acid is 2395 |
--- ------------------------------------------------------------- -----------------------------------------------------------