I have a CSV file which contains many columns for payments.
There's one column column called control_number
. This is unique identifier for every transaction and it follows a pattern, all control numbers must.
Starts with 991
and contains 12 digits. Now, I want to extract list of all control numbers that do not follow a pattern.
Sample data:
control_number |
---|
991100267615 |
991100267632 |
991100267636 |
992100267635 |
991100267672 |
991100267697 |
995100267696 |
991100267726 |
991100267743 |
991100267761 |
991100267771 |
990100267779 |
991100267768 |
991100267811 |
991100267813 |
991100267814 |
Desired output should be like this:
control_number |
---|
992100267635 |
995100267696 |
990100267779 |
Here's what I have so far:
import pandas as pd
import re
import numpy as np
df = pd.read_csv("transactions.csv")
df.head(5)
invalid_transaction = df['control_number']
re.findall(r"(?<!\d)\d{5}(?!\d)", invalid_transaction)
CodePudding user response:
Since all the values are already 12-digit numbers, you do not need a regex here, you can use Series.str.startswith
:
df = df[~df["control_number"].astype(str).str.startswith("991")]
Here is a minified test:
import pandas as pd
df = pd.DataFrame({'control_number':[991100267615, 991100267632, 991100267636, 992100267635]})
Output:
>>> df[~df["control_number"].astype(str).str.startswith("991")]
control_number
3 992100267635
Note the ~
sign means a negation, only those records are taken that do not start with 991
.
With regex, you can also use Series.str.contains
:
df = df[~df["control_number"].astype(str).str.contains(r"^991[0-9]{9}$")]
Or, if you want to use positive regex logic (extract what does not match the pattern):
df = df[df["control_number"].astype(str).str.contains(r"^(?!991[0-9]{9}$)")]
Here, the regex matches
^
- start of string(?!991[0-9]{9}$)
- a negative lookahead that fails the match if, immediately to the right of the current location, there is991
and any nine more digits till end of string.
CodePudding user response:
Try:
print(df[~df["control_number"].astype(str).str.match(r"991")])
Prints:
control_number
3 992100267635
6 995100267696
11 990100267779
Or more specific:
print(df[~df["control_number"].astype(str).str.match(r"991\d{9}$")])