Extract numbers that does not follow a pattern python-CodePudding

I have a CSV file which contains many columns for payments.

There's one column column called control_number. This is unique identifier for every transaction and it follows a pattern, all control numbers must.

Starts with 991 and contains 12 digits. Now, I want to extract list of all control numbers that do not follow a pattern.

Sample data:

control_number
991100267615
991100267632
991100267636
992100267635
991100267672
991100267697
995100267696
991100267726
991100267743
991100267761
991100267771
990100267779
991100267768
991100267811
991100267813
991100267814

Desired output should be like this:

control_number
992100267635
995100267696
990100267779

Here's what I have so far:

import pandas as pd
import re
import numpy as np
df = pd.read_csv("transactions.csv")
df.head(5)
invalid_transaction = df['control_number']
re.findall(r"(?<!\d)\d{5}(?!\d)", invalid_transaction)

CodePudding user response：

Since all the values are already 12-digit numbers, you do not need a regex here, you can use Series.str.startswith:

df = df[~df["control_number"].astype(str).str.startswith("991")]

Here is a minified test:

import pandas as pd
df = pd.DataFrame({'control_number':[991100267615, 991100267632, 991100267636, 992100267635]})

Output:

>>> df[~df["control_number"].astype(str).str.startswith("991")]
   control_number
3    992100267635

Note the ~ sign means a negation, only those records are taken that do not start with 991.

With regex, you can also use Series.str.contains:

df = df[~df["control_number"].astype(str).str.contains(r"^991[0-9]{9}$")]

Or, if you want to use positive regex logic (extract what does not match the pattern):

df = df[df["control_number"].astype(str).str.contains(r"^(?!991[0-9]{9}$)")]

Here, the regex matches

^ - start of string
(?!991[0-9]{9}$) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 991 and any nine more digits till end of string.

CodePudding user response：

Try:

print(df[~df["control_number"].astype(str).str.match(r"991")])

Prints:

    control_number
3     992100267635
6     995100267696
11    990100267779

Or more specific:

print(df[~df["control_number"].astype(str).str.match(r"991\d{9}$")])