In Python, I am comparing dataframes containing strings to decide if it should pass or fail. How can-CodePudding

I have over 20 test cases that check a CSV for data anomalies due to data entry. This test case (#15) compares the salutation and addressee to marital status.

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
df = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15 = df[df['PrimSalText'].str.contains("AND|&", na=False)]

# Adds row to list of failed data
ids = data_15.index.tolist()

# Keep track of data that failed test case 15 
for i in ids:
  data.at[i,'Test Case Failed'] =', 15'

If MrtlStat contains Widow, Divorced, or Single while PrimAddText or PrimSalTexts contains AND or &, it should fail the test. This test works only if BOTH PrimSalTexts and PrimAddText contain AND or &.

Table showing data that passes but should fail:

PrimAddText	PrimSalText	MrtlStat
Mrs. Judith Elfrank	Mr. & Mrs. Elfrank & Michael	Widowed
Mr. & Mrs.Karl Magnusen	Mr. Magnusen	Widowed

Table showing data that fails as expected:

PrimAddText	PrimSalText	MrtlStat
Mr. & Mrs. Elfrank	Mr. & Mrs. Elfrank & Michael	Widowed

How can I adjust the test to work if only one of the columns (PrimSalTexts or PrimAddText) contains AND or &?

CodePudding user response：

You should not filter your data sequentially but combine the condition into a single one (using & and |). A good way to do it is numpy.where:

import pandas as pd
import numpy as np

# construct data
data = pd.DataFrame({
    'PrimAddText': ['Mrs. Judith Elfrank', 'Mr. & Mrs.Karl Magnusen', 'Mr. & Mrs. Elfrank'],
    'PrimSalText': ['Mr. & Mrs. Elfrank & Michael', 'Mr. Magnusen', 'Mr. & Mrs. Elfrank & Michael'],
    'MrtlStat': ['Widowed', 'Widowed', 'Widowed']
})

# Case 15 - create condition
data['Status_case15'] = np.where((data['MrtlStat'].str.contains("Widow|Divorced|Single") 
                           & (data['PrimAddText'].str.contains("AND|&", na=False) 
                              | data['PrimSalText'].str.contains("AND|&", na=False))), False, True)
# filter failing rows
data.loc[data['Status_case15'] == False]

# sum correct rows
sum(data['Status_case15'])

CodePudding user response：

You have an AND condition b/w the second and third condition, you can separate these out and capturing the result from each condition. finally combine the two lists together

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
data_15_A = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15_B = df[df['PrimSalText'].str.contains("AND|&", na=False)]

# Adds row to list of failed data
ids = data_15_A.index.tolist()   data_15_B.index.tolist()

# Keep track of data that failed test case 15 
for i in ids:
  data.at[i,'Test Case Failed'] =', 15'