Home > Back-end >  How to check valid number in pandas dataframe column?
How to check valid number in pandas dataframe column?

Time:02-10

I am working on a project for university in which I received a dataframe:

      import pandas as pd

      df = pd.DataFrame({'id': [0,1,2,3,4,5,6,7,8],
                         'number': ['00005485', '9999', '11111', '458426', '9999999999',
                                    '11111111', '800000', '99999', '1111']})

      print(df)

      id    number
      0    00005485
      1    9999
      2    11111
      3    458426
      4    9999999999
      5    11111111
      6    800000
      7    99999
      8    1111

I would like to generate a dataframe with only the lines that have a valid 'number'. For a 'number' to be considered valid, it cannot have more than four repeating digits that are the same.

I made a code using filter by 'number' column as follows:

      df_result = df[(df['number'] != '9999') & (df['number'] != '99999') &
                     (df['number'] != '999999') & (df['number'] != '9999999') &
                     (df['number'] != '99999999') & (df['number'] != '999999999') &
                     (df['number'] != '9999999999') & (df['number'] != '9999999999') &
                     (df['number'] != '1111') & (df['number'] != '11111') &
                     (df['number'] != '111111') & (df['number'] != '1111111') &
                     (df['number'] != '111111111') & (df['number'] != '11111111')]

       print(df_result)

       id   number
       0    00005485
       3    458426
       6    800000
      

The output is correct. However, I would like to discover another way to do this operation on the dataframe.

CodePudding user response:

You can use regular expression for this, in particular, \1, which matches the first group:

valid = df[~df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]

So that basically says to select all rows that have the start of the string (^), a digit (\d), and then 3 more ({3,}) of that same digit (\1), and then the end of the string ($).

Output:

>>> valid
     number
0  00005485
3    458426
6    800000

If you want to select invalid numbers instead of valid ones, remove the tilde ~ in the condition:

invalid = df[df['number'].astype(str).str.match(r'^(\d)\1{3,}$')]

Output:

>>> invalid
       number
1        9999
2       11111
4  9999999999
5    11111111
7       99999
8        1111
  • Related