find the rows with specific pattern-CodePudding

I need to find entries in a data frame in which one of its columns has a specific pattern like 01-02-11-55-00115 on other words Number-number-number-number-number

raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'code': ['01-02-11-55-00115','01-02-11-55-00445','test', '01-0t-11-55-00115'],
'favorite_color': ['blue', 'blue', 'yellow', &quot;green&quot;],
'grade': [88, 92, 95, 70]}
 df = pd.DataFrame(raw_data)df.head()

retrieve the rows that have that condition , the condition will be in the 'code' column so the code has to retrieve the first and second rows

CodePudding user response：

Here is one way to do it (with str.isnumeric) :

out = df.loc[df["code"].replace("-", "", regex=True).str.isnumeric()]

Output :

print(out)
             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92

Or, if you need to flag those rows, use this :

df["flag"] = df["code"].replace("-", "", regex=True).str.isnumeric()

print(df)
               name               code favorite_color  grade   flag
0    Willard Morris  01-02-11-55-00115           blue     88   True
1       Al Jennings  01-02-11-55-00445           blue     92   True
2      Omar Mullins               test         yellow     95  False
3  Spencer McDaniel  01-0t-11-55-00115          green     70  False

CodePudding user response：

Filter values if decimal or - with Series.str.contains and regex with ^ for start of string, decimals, - value and $ for end of string:

df1 = df[df["code"].str.contains(r'^[0-9\-] $')]
print (df1)
             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92

Or for pattern Number-number-number-number-number use:

df2 = df[df["code"].str.contains(r'^\d \-\d \-\d \-\d \-\d $')]
print (df2)
             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92

CodePudding user response：

Use str.fullmatch for boolean indexing with the (\d -){4}\d regex (4 times \d - where \d is one or more digits, followed by digits):

out = df[df['code'].str.fullmatch(r'(\d -){4}\d ')]

If you need to match the specific "2 digits -" four times then "5 digits":

out = df[df['code'].str.fullmatch(r'(\d\d-){4}\d{5}')]

Output:

             name               code favorite_color  grade
0  Willard Morris  01-02-11-55-00115           blue     88
1     Al Jennings  01-02-11-55-00445           blue     92