Select rows that have 'TEST' but not as part of the name-CodePudding

I am using pyspark and I would like to validate name_df whether the full name is valid or invalid by looking for string of first and last name contains the word 'TEST'.

I have the following as a sample:


names_df = pd.DataFrame({'firstname': ['ETHEL', 'DENISE', 'ALEX', 'LILIAN', 'BERT','TEST'], 
                          'lastname': ['TEST', 'SANTESTIBAN', 'TEST ONLY', 'TESTER','POTESTAN','AA']})

I have the following code:

validated = (
     names_df.withColumn('valid_fullname',
                          when(names_df.firstname.contains('TEST') |
                               names_df.lastname.contains('TEST'), 'Invalid')
                          .otherwise(lit('valid')))
)

What I expect is the following where SANTESTIBAN and POTESTAN will be flagged as valid last names

{
  'firstname': ['ETHEL', 'DENISE', 'ALEX', 'LILIAN', 'BERT' , 'TEST'], 
  'lastname': ['TEST', 'SANTESTIBAN', 'TEST ONLY', 'TESTER','POTESTAN','AA'],
  'valid_fullname':['invalid','valid','invalid','invalid','valid','invalid']
}

But when I run the codes, the validated df shows me that SANTESTIBAN and POTESTAN are flagged as invalid because I think it looks at the TEST substring within each as part of the invalid. I thought of using REGEX but the generated REGEX line does not make sense to me.

The following line is just playing around with text DETESTTHANI TEST ONLY

DETE[a-zA-Z]TTHANI TE[a-zA-Z]T O[a-zA-Z][a-zA-Z]Y

CodePudding user response：

You'll probably want something like \bTEST\b as your regexp; \b matches word boundaries such as spaces and the start or end of the string.

However, that won't match TESTER (since TEST is part of the word) – but on the other hand TESTER could be a real surname, couldn't it?

CodePudding user response：

Consider this as a plan b, especially if the number of invalids is not that many that you can put them in a list.

I usually come across this similar situation. However, there are times I end up with is to check how many are valid names compared to invalid names. When I cannot find a pattern that will encompass all strings to use regex, I will go with finding distinct values of the column, make a list of them, and use a conditional isin().

invalids = ['test','test only','tester','test_only','testonly','retest','tested']

validated = (
     names_df.withColumn('valid_fullname',
                          when(names_df.firstname.isin(invalids) |
                               names_df.lastname.isin(invalids), 'Invalid')
                          .otherwise(lit('valid')))
)

AGAIN, this is just a workaround for a tricky situation like yours.