I am working on a project for university in which I received a dataframe:
import pandas as pd
df = pd.DataFrame({'id': [0,1,2,3,4,5,6,7,8],
'number': ['00005485', '9999', '11111', '458426', '9999999999',
'11111111', '800000', '99999', '1111']})
print(df)
id number
0 00005485
1 9999
2 11111
3 458426
4 9999999999
5 11111111
6 800000
7 99999
8 1111
I would like to generate a dataframe with only the lines that have a valid 'number'. For a 'number' to be considered valid, it cannot have more than four repeating digits that are the same.
I made a code using filter by 'number' column as follows:
df_result = df[(df['number'] != '9999') & (df['number'] != '99999') &
(df['number'] != '999999') & (df['number'] != '9999999') &
(df['number'] != '99999999') & (df['number'] != '999999999') &
(df['number'] != '9999999999') & (df['number'] != '9999999999') &
(df['number'] != '1111') & (df['number'] != '11111') &
(df['number'] != '111111') & (df['number'] != '1111111') &
(df['number'] != '111111111') & (df['number'] != '11111111')]
print(df_result)
id number
0 00005485
3 458426
6 800000
The output is correct. However, I would like to discover another way to do this operation on the dataframe.
CodePudding user response:
You can use regular expression for this, in particular, \1
, which matches the first group:
valid = df[~df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]
So that basically says to select all rows that have the start of the string (^
), a digit (\d
), and then 3 more ({3,}
) of that same digit (\1
), and then the end of the string ($
).
Output:
>>> valid
number
0 00005485
3 458426
6 800000
If you want to select invalid numbers instead of valid ones, remove the tilde ~
in the condition:
invalid = df[df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]
Output:
>>> invalid
number
1 9999
2 11111
4 9999999999
5 11111111
7 99999
8 1111