I have two columns, one contains a string of numbers and one contains a two or three digits, as below:
Account number
0 5493455646944
1 56998884221
2 95853255555926
3 5055555555495718323
4 56999998247361
5 6506569568
I would like to create a regex function which displays a flag if the account number contains more 5 or more consecutive, repeated numbers.
So in theory, the target state is as follows:
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
I was thinking something like:
def reg_finder(x):
return re.findall('^([0-9])\1{5,}$', x)
I am not good with regex at all so unsure...thanks
Edit: this is what I tried:
def reg_finder(x):
return re.findall('\b(\d)\1 \b', x)
example_df['test'] = example_df['Account number'].apply(reg_finder)
Account number test
0 5493455646944 []
1 56998884221 []
2 95853255555926 []
3 5055555555495718323 []
4 56999998247361 []
5 6506569568 []
CodePudding user response:
You can use
import pandas as pd
import warnings
warnings.filterwarnings("ignore", message="This pattern has match groups")
df = pd.DataFrame({'Account number':["5493455646944","56998884221","95853255555926","5055555555495718323","56999998247361","6506569568"]})
df['test'] = "No"
df.loc[df["Account number"].str.contains(r'([0-9])\1{4,}'), 'test'] = "Yes"
Output:
>>> df
Account number test
0 5493455646944 No
1 56998884221 No
2 95853255555926 Yes
3 5055555555495718323 Yes
4 56999998247361 Yes
5 6506569568 No
Note that r'([0-9])\1{4,}'
regex is defined with a raw string literal where backslashes are parsed as literal backslashes, and not string escape sequence auxiliary chars.