Home > Software engineering >  Keep records with a specific prefix and filter all-numeric
Keep records with a specific prefix and filter all-numeric

Time:01-24

I have a pyspark dataframe looks like below:

serial_number
000001234
000002887
00008765
0745-218
01-7865
040/7868L
0000124
00002364
01231325246
068775H

I want to extract only the records that start with the prefix 0 (single 0 at start) and that are not only numeric. i.e. it should have alphabetic and/or special characters only numeric. So I want to only keep:

serial_number
0745-218
01-7865
040/7868L
068775H

I tried to use some regex expressions like ^0[^0] but it also accepts all-numeric entries.

CodePudding user response:

Use rlike. Code below

df.where(col('serial_number').rlike('\D')&col('serial_number').rlike('^0')).show()

CodePudding user response:

import re

serial_numbers = [
    "000001234",
    "000002887",
    "00008765",
    "0745-218",
    "01-7865",
    "040/7868L",
    "0000124",
    "00002364",
    "01231325246",
    "068775H"
]

pattern = "^0[^0-9] "

matching_numbers = [number for number in serial_numbers if re.match(pattern, number)]

print(matching_numbers)
  • Related