Home > database >  find the rows with specific pattern
find the rows with specific pattern

Time:02-01

I need to find entries in a data frame in which one of its columns has a specific pattern like 01-02-11-55-00115 on other words Number-number-number-number-number

raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'code': ['01-02-11-55-00115','01-02-11-55-00445','test', '01-0t-11-55-00115'],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
 df = pd.DataFrame(raw_data)df.head()

retrieve the rows that have that condition , the condition will be in the 'code' column so the code has to retrieve the first and second rows

CodePudding user response:

Here is one way to do it (with str.isnumeric) :

out = df.loc[df["code"].replace("-", "", regex=True).str.isnumeric()]

​ Output :

print(out)
             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92

Or, if you need to flag those rows, use this :

df["flag"] = df["code"].replace("-", "", regex=True).str.isnumeric()

print(df)
               name               code favorite_color  grade   flag
0    Willard Morris  01-02-11-55-00115           blue     88   True
1       Al Jennings  01-02-11-55-00445           blue     92   True
2      Omar Mullins               test         yellow     95  False
3  Spencer McDaniel  01-0t-11-55-00115          green     70  False

CodePudding user response:

Filter values if decimal or - with Series.str.contains and regex with ^ for start of string, decimals, - value and $ for end of string:

df1 = df[df["code"].str.contains(r'^[0-9\-] $')]
print (df1)
             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92

Or for pattern Number-number-number-number-number use:

df2 = df[df["code"].str.contains(r'^\d \-\d \-\d \-\d \-\d $')]
print (df2)
             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92

CodePudding user response:

Use str.fullmatch for boolean indexing with the (\d -){4}\d regex (4 times \d - where \d is one or more digits, followed by digits):

out = df[df['code'].str.fullmatch(r'(\d -){4}\d ')]

If you need to match the specific "2 digits -" four times then "5 digits":

out = df[df['code'].str.fullmatch(r'(\d\d-){4}\d{5}')]

Output:

             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92
  • Related