I have a pyspark dataframe looks like below:
serial_number
000001234
000002887
00008765
0745-218
01-7865
040/7868L
0000124
00002364
01231325246
068775H
I want to extract only the records that start with the prefix 0 (single 0 at start) and that are not only numeric. i.e. it should have alphabetic and/or special characters only numeric. So I want to only keep:
serial_number
0745-218
01-7865
040/7868L
068775H
I tried to use some regex expressions like ^0[^0]
but it also accepts all-numeric entries.
CodePudding user response:
Use rlike. Code below
df.where(col('serial_number').rlike('\D')&col('serial_number').rlike('^0')).show()
CodePudding user response:
import re
serial_numbers = [
"000001234",
"000002887",
"00008765",
"0745-218",
"01-7865",
"040/7868L",
"0000124",
"00002364",
"01231325246",
"068775H"
]
pattern = "^0[^0-9] "
matching_numbers = [number for number in serial_numbers if re.match(pattern, number)]
print(matching_numbers)