How to check valid number in pandas dataframe column?-CodePudding

I am working on a project for university in which I received a dataframe:

      import pandas as pd

      df = pd.DataFrame({'id': [0,1,2,3,4,5,6,7,8],
                         'number': ['00005485', '9999', '11111', '458426', '9999999999',
                                    '11111111', '800000', '99999', '1111']})

      print(df)

      id    number
      0    00005485
      1    9999
      2    11111
      3    458426
      4    9999999999
      5    11111111
      6    800000
      7    99999
      8    1111

I would like to generate a dataframe with only the lines that have a valid 'number'. For a 'number' to be considered valid, it cannot have more than four repeating digits that are the same.

I made a code using filter by 'number' column as follows:

      df_result = df[(df['number'] != '9999') & (df['number'] != '99999') &
                     (df['number'] != '999999') & (df['number'] != '9999999') &
                     (df['number'] != '99999999') & (df['number'] != '999999999') &
                     (df['number'] != '9999999999') & (df['number'] != '9999999999') &
                     (df['number'] != '1111') & (df['number'] != '11111') &
                     (df['number'] != '111111') & (df['number'] != '1111111') &
                     (df['number'] != '111111111') & (df['number'] != '11111111')]

       print(df_result)

       id   number
       0    00005485
       3    458426
       6    800000

The output is correct. However, I would like to discover another way to do this operation on the dataframe.

CodePudding user response：

You can use regular expression for this, in particular, \1, which matches the first group:

valid = df[~df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]

So that basically says to select all rows that have the start of the string (^), a digit (\d), and then 3 more ({3,}) of that same digit (\1), and then the end of the string ($).

Output:

>>> valid
     number
0  00005485
3    458426
6    800000

If you want to select invalid numbers instead of valid ones, remove the tilde ~ in the condition:

invalid = df[df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]

Output:

>>> invalid
       number
1        9999
2       11111
4  9999999999
5    11111111
7       99999
8        1111