I am using pyspark and I would like to validate name_df whether the full name is valid or invalid by looking for string of first and last name contains the word 'TEST'
.
I have the following as a sample:
names_df = pd.DataFrame({'firstname': ['ETHEL', 'DENISE', 'ALEX', 'LILIAN', 'BERT','TEST'],
'lastname': ['TEST', 'SANTESTIBAN', 'TEST ONLY', 'TESTER','POTESTAN','AA']})
I have the following code:
validated = (
names_df.withColumn('valid_fullname',
when(names_df.firstname.contains('TEST') |
names_df.lastname.contains('TEST'), 'Invalid')
.otherwise(lit('valid')))
)
What I expect is the following where SANTESTIBAN
and POTESTAN
will be flagged as valid last names
{
'firstname': ['ETHEL', 'DENISE', 'ALEX', 'LILIAN', 'BERT' , 'TEST'],
'lastname': ['TEST', 'SANTESTIBAN', 'TEST ONLY', 'TESTER','POTESTAN','AA'],
'valid_fullname':['invalid','valid','invalid','invalid','valid','invalid']
}
But when I run the codes, the validated df shows me that SANTESTIBAN and POTESTAN are flagged as invalid because I think it looks at the TEST substring within each as part of the invalid. I thought of using REGEX but the generated REGEX line does not make sense to me.
The following line is just playing around with text DETESTTHANI TEST ONLY
DETE[a-zA-Z]TTHANI TE[a-zA-Z]T O[a-zA-Z][a-zA-Z]Y
CodePudding user response:
You'll probably want something like \bTEST\b
as your regexp; \b
matches word boundaries such as spaces and the start or end of the string.
However, that won't match TESTER
(since TEST
is part of the word) – but on the other hand TESTER
could be a real surname, couldn't it?
CodePudding user response:
Consider this as a plan b, especially if the number of invalids is not that many that you can put them in a list.
I usually come across this similar situation. However, there are times I end up with is to check how many are valid names compared to invalid names. When I cannot find a pattern that will encompass all strings to use regex, I will go with finding distinct values of the column, make a list of them, and use a conditional isin().
invalids = ['test','test only','tester','test_only','testonly','retest','tested']
validated = (
names_df.withColumn('valid_fullname',
when(names_df.firstname.isin(invalids) |
names_df.lastname.isin(invalids), 'Invalid')
.otherwise(lit('valid')))
)
AGAIN, this is just a workaround for a tricky situation like yours.