I have over 20 test cases that check a CSV for data anomalies due to data entry. This test case (#15) compares the salutation and addressee to marital status.
# Test case 15
# Compares MrtlStat to PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
df = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15 = df[df['PrimSalText'].str.contains("AND|&", na=False)]
# Adds row to list of failed data
ids = data_15.index.tolist()
# Keep track of data that failed test case 15
for i in ids:
data.at[i,'Test Case Failed'] =', 15'
If MrtlStat contains Widow, Divorced, or Single while PrimAddText or PrimSalTexts contains AND or &, it should fail the test. This test works only if BOTH PrimSalTexts and PrimAddText contain AND or &.
Table showing data that passes but should fail:
PrimAddText | PrimSalText | MrtlStat |
---|---|---|
Mrs. Judith Elfrank | Mr. & Mrs. Elfrank & Michael | Widowed |
Mr. & Mrs.Karl Magnusen | Mr. Magnusen | Widowed |
Table showing data that fails as expected:
PrimAddText | PrimSalText | MrtlStat |
---|---|---|
Mr. & Mrs. Elfrank | Mr. & Mrs. Elfrank & Michael | Widowed |
How can I adjust the test to work if only one of the columns (PrimSalTexts or PrimAddText) contains AND or &?
CodePudding user response:
You should not filter your data sequentially but combine the condition into a single one (using & and |). A good way to do it is numpy.where:
import pandas as pd
import numpy as np
# construct data
data = pd.DataFrame({
'PrimAddText': ['Mrs. Judith Elfrank', 'Mr. & Mrs.Karl Magnusen', 'Mr. & Mrs. Elfrank'],
'PrimSalText': ['Mr. & Mrs. Elfrank & Michael', 'Mr. Magnusen', 'Mr. & Mrs. Elfrank & Michael'],
'MrtlStat': ['Widowed', 'Widowed', 'Widowed']
})
# Case 15 - create condition
data['Status_case15'] = np.where((data['MrtlStat'].str.contains("Widow|Divorced|Single")
& (data['PrimAddText'].str.contains("AND|&", na=False)
| data['PrimSalText'].str.contains("AND|&", na=False))), False, True)
# filter failing rows
data.loc[data['Status_case15'] == False]
# sum correct rows
sum(data['Status_case15'])
CodePudding user response:
You have an AND condition b/w the second and third condition, you can separate these out and capturing the result from each condition. finally combine the two lists together
# Test case 15
# Compares MrtlStat to PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
data_15_A = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15_B = df[df['PrimSalText'].str.contains("AND|&", na=False)]
# Adds row to list of failed data
ids = data_15_A.index.tolist() data_15_B.index.tolist()
# Keep track of data that failed test case 15
for i in ids:
data.at[i,'Test Case Failed'] =', 15'