i have two dataset one is from mysql db (source) another is from snowflake db (target). i picked only one column for row level validation from both db. below is sample data. Source
emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10
target
emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7
Name9
Name10
Name11
expected output is
src_emp_name tgt emp name Record Valdation
Name1 Name1 Match
Name2 NAME2 Mismatch
Name3 Name3 Match
Name4 Name4 Match
Name5 Name5 Match
Name6 Name6 Match
Name7 Name7 Match
Name8 null Extra
Name9 Name9 Match
Name10 Name10 Match
null Name11 Missing
count matrix
Match data 8
mismatch data 1
missing data 1
extra 1
i tried to combine two dataset(src/tgt)with merge/concat function and used np.where for creating new column based on conditions but not getting the expected output. Please suggest better way to achieve this.
CodePudding user response:
if you have records ordered, assigning index column and join based on that will be appropriate
df_source = pd.DataFrame(['Name1',
'Name2',
'Name3',
'Name4',
'Name5',
'Name6',
'Name7',
'Name8',
'Name9',
'Name10',
], columns =['emp_name'])
df_target = pd.DataFrame(['Name1',
'NAME2',
'Name3',
'Name4',
'Name5',
'Name6',
'Name7',
None,
'Name9',
'Name10',
'Name11',
], columns =['emp_name'])
df = pd.merge(df_source, df_target,how ='outer', left_index=True, right_index=True, suffixes = ('_source', '_target'))
conditions = [
df['emp_name_source'].isnull(),
df['emp_name_target'].isnull(),
df['emp_name_target'] != df['emp_name_source'],
df['emp_name_target'] == df['emp_name_source'],
]
choices = ['Missing', 'Extra', 'Mismatch', 'Match']
df['record_valdation'] = np.select(conditions, choices)
res = df.groupby(['record_valdation'])['record_valdation'].count()
display(res)
The last line displays the result as below:
record_valdation | |
---|---|
Extra | 1 |
Match | 8 |
Mismatch | 1 |
Missing | 1 |
I hope it helps, thank you :)
CodePudding user response:
I have tried to combine two datasets (src/tgt) with merge/concat
function and used np.where
for creating new column based on conditions but not getting the expected output. Please suggest better way to achieve this.
The code below uses concat
as you did, but as a better way to achieve what you are after I suggest to use pandas own .apply()
method instead of using numpy
. This requires to define a function which takes a pandas DataFrame row and returns a new column value.
Why another answer to the question?
The other answer uses another method for creating the 'Record Validation' column in the merged pandas DataFrame using the numpy.select()
method for creating a list of values for this column. This approach requires an import of the numpy
module and requires to be extremely careful with the code for the list of conditions and the list of choices as it will become very hard to debug such code if in the resulting list for 'Record Validation' column appear a value 0
instead of one from the list of choices or if the result will be not as expected. Another disadvantage of using numpy.select()
is that in case of very large datasets there is unnecessary evaluation of all conditions for all rows involved (while creating the conditions
list) which is not the case if a function for pandas pandas.apply()
is used.
Other difference is that the code in the other answer doesn't handle the case of an empty string in the data row as suggested in the question giving the lines of data with an empty line. There is also no replacement for pandas NaN values and for the empty string values with 'null' as suggested in the question for the merged DataFrame with the 'Record Validation' column.
The code below uses instead of a list with conditions and a list of value choices a function which will be applied to each row of the DataFrame to obtain the value for the 'Record Validation' column to add. This gives in my eyes more room for flexibility when some other evaluation is also to be done in addition to only creating the new DataFrame column. The code below demonstrates this flexibility creating a dictionary with the record validation results and provide multi-line strings and comment lines with additional information.
The function record_validation_and_count:
merged_df, count_matrix_df = record_validation_and_count( src_df, tgt_df )
provided in the code accepts both data types for input: multiline strings and pandas DataFrames. It creates also in a global dictionary count_matrix_dct
data for output of the count matrix as the matrix obtained by pandas grouping and counting code can't list counts for not found values (in other words there will be no entries in the pandas count matrix DataFrame having the value 0).
import pandas as pd
""" I have two dataset one is from mysql db (source) another is from
snowflake db (target). i picked only one column for row level validation
from both db. below is sample data. Source"""
# Assuming that the datasets come as multiline strings:
src_df = """\
emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10"""
tgt_df = """\
emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7
Name9
Name10
Name11"""
"""expected output is
src_emp_name tgt emp name Record Valdation
Name1 Name1 Match
Name2 NAME2 Mismatch
Name3 Name3 Match
Name4 Name4 Match
Name5 Name5 Match
Name6 Name6 Match
Name7 Name7 Match
Name8 null Extra
Name9 Name9 Match
Name10 Name10 Match
null Name11 Missing
count matrix
Match data 8
mismatch data 1
missing data 1
extra 1"""
"""i tried to combine two dataset(src/tgt)with merge/concat function"""
# USE: merged_df = pd.concat([src_df_1, tgt_df_1], axis=1)
"""and used np.where for creating new column based on conditions but not
getting the expected output. Please suggest better way to achieve this."""
# A better approach will be using a function which is taking a row from
# the dataframe as parameter for creating new column rows.
# This allows to include in this function modifications of the source
# dataframe replacing pandas NaN and empty '' strings with null along
# other code for evaluation (here: a dictionary for the count matrix):
count_matrix_dct = {
'match data' : 0,
'mismatch data' : 0,
'missing data' : 0,
'extra' : 0
}
def record_validation(row, case_sensitive=True):
# can be changed to False HERE --^
global count_matrix_dct
# v-- first column of the merged DataFrame
if row.iloc[0] == '' or pd.isnull(row.iloc[0]): # but [:,0] if in df
count_matrix_dct['missing data'] = 1
row.iloc[0] = 'null'
return 'Missing'
# v-- second column of the merged DataFrame
elif row.iloc[1] == '' or pd.isnull(row.iloc[1]):
count_matrix_dct['extra'] = 1
row.iloc[1] = 'null'
return 'Extra'
if case_sensitive:
# is used to remove leading/trailing spaces --v
ismatch = row.iloc[0].strip() == row.iloc[1].strip()
else:
# assure a not case sensitive Match --v
ismatch = row.iloc[0].upper().strip() == row.iloc[1].upper().strip()
if ismatch:
count_matrix_dct['match data'] = 1
return 'Match'
else:
count_matrix_dct['mismatch data'] = 1
return 'Mismatch'
raise ValueError("None of the conditions gives a return value")
#:def
# ---
# Once there are data available as multiline string or pandas dataframe
# you can use following functions ... :
def record_validation_and_count( src_df, tgt_df ):
""" Returns DataFrame with Record Validation row and a DataFrame
with Record Validation counts """
global pd
if isinstance(src_df, str): # if data are provided as string
src_ls = src_df.split('\n')
src_df = pd.DataFrame(src_ls[1:], columns=[src_ls[0]])
if isinstance(tgt_df, str):
tgt_ls = tgt_df.split('\n')
tgt_df = pd.DataFrame(tgt_ls[1:], columns=[tgt_ls[0]])
if not ( isinstance(src_df, pd.DataFrame) and isinstance(tgt_df, pd.DataFrame) ):
raise ValueError("Valid data types are: 'str' and 'pd.DataFrame'")
src_df.rename(columns = {src_df.columns[0]:'src_' str(src_df.columns[0])}, inplace = True)
tgt_df.rename(columns = {tgt_df.columns[0]:'tgt_' str(tgt_df.columns[0])}, inplace = True)
# in case of numerical column name necessary -> str()
merged_df = pd.concat([src_df, tgt_df], axis=1)
merged_df['Record Validation'] = merged_df.apply(record_validation, axis=1)
# apply function ( record_validation ) to each row <- axis=1
# apply function ( record_validation ) to each column <- axis=0
count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
return merged_df, count_matrix_df
#:def
def print_results():
print(count_matrix_dct)
print('====================')
print("count matrix")
for k,v in count_matrix_dct.items(): print(f'{k:15} {v:5d}')
print('====================')
count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
print(count_matrix_df)
print('====================')
print(merged_df)
#:def
# ... to create the merged dataframe with Record Validation column
merged_df, count_matrix_df = record_validation_and_count( src_df, tgt_df )
# and print the results
print_results()
The code above gives following output:
count_matrix_dct={'match data': 8, 'mismatch data': 1, 'missing data': 1, 'extra': 1}
====================
count matrix
match data 8
mismatch data 1
missing data 1
extra 1
====================
Record Validation
Extra 1
Match 8
Mismatch 1
Missing 1
Name: Record Validation, dtype: int64
====================
src emp name tgt emp name Record Validation
0 Name1 Name1 Match
1 Name2 NAME2 Mismatch
2 Name3 Name3 Match
3 Name4 Name4 Match
4 Name5 Name5 Match
5 Name6 Name6 Match
6 Name7 Name7 Match
7 Name8 null Extra
8 Name9 Name9 Match
9 Name10 Name10 Match
10 null Name11 Missing