Home > Net >  Extract numbers that does not follow a pattern python
Extract numbers that does not follow a pattern python

Time:07-17

I have a CSV file which contains many columns for payments.

There's one column column called control_number. This is unique identifier for every transaction and it follows a pattern, all control numbers must.

Starts with 991 and contains 12 digits. Now, I want to extract list of all control numbers that do not follow a pattern.

Sample data:

control_number
991100267615
991100267632
991100267636
992100267635
991100267672
991100267697
995100267696
991100267726
991100267743
991100267761
991100267771
990100267779
991100267768
991100267811
991100267813
991100267814

Desired output should be like this:

control_number
992100267635
995100267696
990100267779

Here's what I have so far:

import pandas as pd
import re
import numpy as np
df = pd.read_csv("transactions.csv")
df.head(5)
invalid_transaction = df['control_number']
re.findall(r"(?<!\d)\d{5}(?!\d)", invalid_transaction)

CodePudding user response:

Since all the values are already 12-digit numbers, you do not need a regex here, you can use Series.str.startswith:

df = df[~df["control_number"].astype(str).str.startswith("991")]

Here is a minified test:

import pandas as pd
df = pd.DataFrame({'control_number':[991100267615, 991100267632, 991100267636, 992100267635]})

Output:

>>> df[~df["control_number"].astype(str).str.startswith("991")]
   control_number
3    992100267635

Note the ~ sign means a negation, only those records are taken that do not start with 991.

With regex, you can also use Series.str.contains:

df = df[~df["control_number"].astype(str).str.contains(r"^991[0-9]{9}$")]

Or, if you want to use positive regex logic (extract what does not match the pattern):

df = df[df["control_number"].astype(str).str.contains(r"^(?!991[0-9]{9}$)")]

Here, the regex matches

  • ^ - start of string
  • (?!991[0-9]{9}$) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 991 and any nine more digits till end of string.

CodePudding user response:

Try:

print(df[~df["control_number"].astype(str).str.match(r"991")])

Prints:

    control_number
3     992100267635
6     995100267696
11    990100267779

Or more specific:

print(df[~df["control_number"].astype(str).str.match(r"991\d{9}$")])
  • Related