Home > other >  Pandas data validation with regex on one column
Pandas data validation with regex on one column

Time:05-01

What I want to do is look for a specific pattern. 1 letter, a dash, followed by a year and letter like "A-2012A". After that, the rest of the column's value can be anything. I want to confirm this first part. And return a true/false value. Is it possible?

pattern letter-yearletter

String validation on one column with regular expression.

example_column_1

DNA \ Assay
A-2000X-27
A-2000X-32
A-2000X-45
A-2000X-48
A-2000X-80
truth_value = df['DNA \ Assay'].str.match(r'').astype(bool)

Sample, with nothing in the r'' regular expression.

My expected output would be True

example_column_2

DNA \ Assay
Embryo FTA-Code-ID-2
Embryo FTA-Code-ID-3
Embryo FTA-Code-ID-4
Embryo FTA-Code-ID-5
Embryo FTA-Code-ID-6

My expected output with example_column_2 would be False

CodePudding user response:

Use a regex:

df['valid'] = df['DNA \\ Assay'].str.match(r'[A-Z]-\d{4}[A-Z]', case=False)

output:

  DNA \ Assay  valid
0  A-2000X-27   True
1  A-2000X-32   True
2  A-2000X-45   True
3  A-2000X-48   True
4  A-2000X-80   True

If you want to validate all values:

df['DNA \\ Assay'].str.match(r'[A-Z]-\d{4}[A-Z]', case=False).all()

output: True

  • Related