I have two data frames that I am looking to apply two separate functions to that will perform validation checks on each data frame independently, and then any differences that arise will get concatenated into one transformed list.
The issue I am facing is that the first validation check should happen ONLY if numeric values exist in ALL of the numeric columns of whichever of the two data frames it is analyzing. If there are ANY NaN values in a row line for the first validation check, then that row should be skipped.
The second validation check does not need that specification.
Here are the data frames, functions, and transformations:
import pandas as pd
import numpy as np
df1 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.5,3,4,np.nan,3.5,1.5,2],'Amount':[40,19,np.nan,np.nan,60,70,80,np.nan,45,102],
'Quantity Frozen':[3,4,np.nan,15,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[37,12,np.nan,45,np.nan,61,np.nan,24,14,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,21,40]}
df1 = pd.DataFrame(df1, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
df2 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.6,3,4,np.nan,3.5,1.5,2],'Amount':[40,16,np.nan,np.nan,60,72,80,np.nan,45,100],
'Quantity Frozen':[3,4,np.nan,np.nan,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[np.nan,12,np.nan,45,np.nan,61,np.nan,24,15,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,20,40]}
df2 = pd.DataFrame(df2, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh'] dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
#Validation Check 2:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Multiple'] = dataset['Price'] * dataset['Quantity Fresh']
for varname,var in {'Multiple vs. Price x Quantity Fresh':'dif_Multiple'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
# #Wrangling internal inconsistency data frames to be in correct format
inconsistency_vars = ['dif_Stock on Hand','dif_Multiple']
inconsistency_var_betternames = {'dif_Stock on Hand':'Stock on Hand = Quantity Fresh Quantity Frozen','dif_Multiple':'Multiple = Price x Quantity on Hand'}
# #Rollup1
idvars1=['Fruits']
df1 = df1[idvars1 inconsistency_vars]
df2 = df2[idvars1 inconsistency_vars]
df1 = df1.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df2 = df2.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df1['dataset'] = 'Fruit Dataset1'
df2['dataset'] = 'Fruit Dataset2'
# #First table in Internal Inconsistencies Sheet (Table 5)
inconsistent = pd.concat([df1,df2])
inconsistent = inconsistent[['variable','Difference Magnitude','dataset','Fruits']]
inconsistent['variable'] = inconsistent['variable'].map(inconsistency_var_betternames)
inconsistent = inconsistent[inconsistent['Difference Magnitude'] != 0]
Here is the desired output, which for the first validation check skips rows in either data frame that have ANY NaN values in the numeric columns (every column but 'Fruits'):
#Desired output
inconsistent_true = {'variable': ["Stock on Hand = Quantity Fresh Quantity Frozen","Stock on Hand = Quantity Fresh Quantity Frozen","Multiple = Price x Quantity on Hand",
"Multiple = Price x Quantity on Hand","Multiple = Price x Quantity on Hand"],
'Difference Magnitude': [1,2,1,4.5,2.5],
'dataset':["Fruit Dataset1","Fruit Dataset1","Fruit Dataset2","Fruit Dataset2","Fruit Datset2"],
'Fruits':["Blueberry","Coconut","Blueberry","Cherry","Pear"]}
inconsistent_true = pd.DataFrame(inconsistent_true, columns = ['variable', 'Difference Magnitude','dataset','Fruits'])
CodePudding user response:
A pandas function that may come in handy is pd.isnull() return True for np.nan value-
For example take df1-
pd.isnull(df1['Amount'][2])
True
This can be added as a check to all your numeric columns as such and then use only rows that have column 'numeric_check' value as 1-
df1['numeric_check'] = df1.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
CodePudding user response:
I hope I got your intention.
# make boolean mask, True if all numeric values are not NaN
mask = df1.select_dtypes('number').notna().all(axis=1)
print(df1[mask])
Fruits Price Amount Quantity Frozen Quantity Fresh Multiple
0 Banana 2.0 40.0 3.0 37.0 74.0
1 Blueberry 1.5 19.0 4.0 12.0 17.0
5 Pineapple 4.0 70.0 9.0 61.0 244.0
9 Coconut 2.0 102.0 80.0 20.0 40.0