combine two dataset and create new column based on specific condition in python-CodePudding

i have two dataset one is from mysql db (source) another is from snowflake db (target). i picked only one column for row level validation from both db. below is sample data. Source

emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10

target

emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7

Name9
Name10
Name11

expected output is

src_emp_name tgt emp name Record Valdation 
Name1         Name1    Match
Name2         NAME2    Mismatch
Name3         Name3    Match
Name4         Name4    Match
Name5         Name5    Match
Name6         Name6    Match
Name7         Name7    Match
Name8         null     Extra
Name9         Name9    Match
Name10        Name10   Match
null          Name11   Missing


count matrix
Match data      8
mismatch data   1
missing data    1
extra           1

i tried to combine two dataset(src/tgt)with merge/concat function and used np.where for creating new column based on conditions but not getting the expected output. Please suggest better way to achieve this.

CodePudding user response：

if you have records ordered, assigning index column and join based on that will be appropriate

df_source = pd.DataFrame(['Name1',
                        'Name2',
                        'Name3',
                        'Name4',
                        'Name5',
                        'Name6',
                        'Name7',
                        'Name8',
                        'Name9',
                        'Name10',
                        ], columns =['emp_name'])

df_target = pd.DataFrame(['Name1',
                        'NAME2',
                        'Name3',
                        'Name4',
                        'Name5',
                        'Name6',
                        'Name7',
                        None,
                        'Name9',
                        'Name10',
                        'Name11',
                        ], columns =['emp_name'])


df = pd.merge(df_source, df_target,how ='outer', left_index=True, right_index=True, suffixes = ('_source', '_target'))

conditions = [
    df['emp_name_source'].isnull(),
    df['emp_name_target'].isnull(),
    df['emp_name_target'] != df['emp_name_source'],
    df['emp_name_target'] == df['emp_name_source'],
]
choices = ['Missing', 'Extra', 'Mismatch', 'Match']
df['record_valdation'] = np.select(conditions, choices)

res = df.groupby(['record_valdation'])['record_valdation'].count()
display(res)

The last line displays the result as below:

record_valdation
Extra	1
Match	8
Mismatch	1
Missing	1

I hope it helps, thank you :)

CodePudding user response：

I have tried to combine two datasets (src/tgt) with merge/concat function and used np.where for creating new column based on conditions but not getting the expected output. Please suggest better way to achieve this.

The code below uses concat as you did, but as a better way to achieve what you are after I suggest to use pandas own .apply() method instead of using numpy. This requires to define a function which takes a pandas DataFrame row and returns a new column value.

Why another answer to the question?

The other answer uses another method for creating the 'Record Validation' column in the merged pandas DataFrame using the numpy.select() method for creating a list of values for this column. This approach requires an import of the numpy module and requires to be extremely careful with the code for the list of conditions and the list of choices as it will become very hard to debug such code if in the resulting list for 'Record Validation' column appear a value 0 instead of one from the list of choices or if the result will be not as expected. Another disadvantage of using numpy.select() is that in case of very large datasets there is unnecessary evaluation of all conditions for all rows involved (while creating the conditions list) which is not the case if a function for pandas pandas.apply() is used.

Other difference is that the code in the other answer doesn't handle the case of an empty string in the data row as suggested in the question giving the lines of data with an empty line. There is also no replacement for pandas NaN values and for the empty string values with 'null' as suggested in the question for the merged DataFrame with the 'Record Validation' column.

The code below uses instead of a list with conditions and a list of value choices a function which will be applied to each row of the DataFrame to obtain the value for the 'Record Validation' column to add. This gives in my eyes more room for flexibility when some other evaluation is also to be done in addition to only creating the new DataFrame column. The code below demonstrates this flexibility creating a dictionary with the record validation results and provide multi-line strings and comment lines with additional information.

The function record_validation_and_count:

merged_df, count_matrix_df = record_validation_and_count( src_df, tgt_df )

provided in the code accepts both data types for input: multiline strings and pandas DataFrames. It creates also in a global dictionary count_matrix_dct data for output of the count matrix as the matrix obtained by pandas grouping and counting code can't list counts for not found values (in other words there will be no entries in the pandas count matrix DataFrame having the value 0).

import pandas as pd
""" I have two dataset one is from mysql db (source) another is from 
snowflake db (target). i picked only one column for row level validation
from both db. below is sample data. Source"""
# Assuming that the datasets come as multiline strings: 
src_df = """\
emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10"""
tgt_df = """\
emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7

Name9
Name10
Name11"""

"""expected output is

src_emp_name tgt emp name Record Valdation 
Name1         Name1    Match
Name2         NAME2    Mismatch
Name3         Name3    Match
Name4         Name4    Match
Name5         Name5    Match
Name6         Name6    Match
Name7         Name7    Match
Name8         null     Extra
Name9         Name9    Match
Name10        Name10   Match
null          Name11   Missing

count matrix
Match data      8
mismatch data   1
missing data    1
extra           1"""

"""i tried to combine two dataset(src/tgt)with merge/concat function"""
# USE: merged_df = pd.concat([src_df_1, tgt_df_1], axis=1)

"""and used np.where for creating new column based on conditions but not 
getting the expected output. Please suggest better way to achieve this."""
# A better approach will be using a function which is taking a row from
# the dataframe as parameter for creating new column rows. 
# This allows to include in this function modifications of the source
# dataframe replacing pandas NaN and empty '' strings with null along
# other code for evaluation (here: a dictionary for the count matrix): 
count_matrix_dct = {
    'match data'    : 0,
    'mismatch data' : 0,
    'missing data'  : 0,
    'extra'         : 0
    }
def record_validation(row, case_sensitive=True):
    #       can be changed to False HERE  --^  
    global count_matrix_dct
    #           v-- first column of the merged DataFrame
    if row.iloc[0] == '' or pd.isnull(row.iloc[0]): # but [:,0] if in df
        count_matrix_dct['missing data']  = 1
        row.iloc[0] = 'null'
        return 'Missing'
    #             v-- second column of the merged DataFrame
    elif row.iloc[1] == '' or pd.isnull(row.iloc[1]):
        count_matrix_dct['extra']  = 1
        row.iloc[1] = 'null'
        return 'Extra'
    if case_sensitive: 
        # is used to remove leading/trailing spaces  --v
        ismatch = row.iloc[0].strip() == row.iloc[1].strip()
    else: 
        # assure a not case sensitive  Match                 --v
        ismatch = row.iloc[0].upper().strip() == row.iloc[1].upper().strip()
    if ismatch: 
        count_matrix_dct['match data']  = 1
        return 'Match'
    else:
        count_matrix_dct['mismatch data']  = 1
        return 'Mismatch'
    raise ValueError("None of the conditions gives a return value") 
#:def
# ---
# Once there are data available as multiline string or pandas dataframe
# you can use following functions ... : 
def record_validation_and_count( src_df, tgt_df ):
    """ Returns DataFrame with Record Validation row and a DataFrame
    with Record Validation counts """
    global pd
    if isinstance(src_df, str): # if data are provided as string
        src_ls = src_df.split('\n')      
        src_df = pd.DataFrame(src_ls[1:],  columns=[src_ls[0]])
    if isinstance(tgt_df, str): 
        tgt_ls = tgt_df.split('\n')      
        tgt_df = pd.DataFrame(tgt_ls[1:],  columns=[tgt_ls[0]])
    if not ( isinstance(src_df, pd.DataFrame) and isinstance(tgt_df, pd.DataFrame) ):
        raise ValueError("Valid data types are: 'str'  and  'pd.DataFrame'")
    src_df.rename(columns = {src_df.columns[0]:'src_' str(src_df.columns[0])}, inplace = True)
    tgt_df.rename(columns = {tgt_df.columns[0]:'tgt_' str(tgt_df.columns[0])}, inplace = True)
    # in case of numerical column name necessary  ->  str()
    merged_df = pd.concat([src_df, tgt_df], axis=1)
    merged_df['Record Validation'] = merged_df.apply(record_validation, axis=1)
    #         apply function ( record_validation ) to each row      <-  axis=1 
    #         apply function ( record_validation ) to each column   <-  axis=0 
    count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
    return merged_df, count_matrix_df
#:def
def print_results():    
   print(count_matrix_dct)
   print('====================')
   print("count matrix")
   for k,v in count_matrix_dct.items(): print(f'{k:15} {v:5d}')
   print('====================')
   count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
   print(count_matrix_df)
   print('====================')
   print(merged_df)
#:def
# ... to create the merged dataframe with Record Validation column 
merged_df, count_matrix_df = record_validation_and_count( src_df, tgt_df )
# and print the results
print_results()

The code above gives following output:

count_matrix_dct={'match data': 8, 'mismatch data': 1, 'missing data': 1, 'extra': 1}
====================
count matrix
match data          8
mismatch data       1
missing data        1
extra               1
====================
Record Validation
Extra       1
Match       8
Mismatch    1
Missing     1
Name: Record Validation, dtype: int64
====================
   src emp name tgt emp name Record Validation
0         Name1        Name1             Match
1         Name2        NAME2          Mismatch
2         Name3        Name3             Match
3         Name4        Name4             Match
4         Name5        Name5             Match
5         Name6        Name6             Match
6         Name7        Name7             Match
7         Name8         null             Extra
8         Name9        Name9             Match
9        Name10       Name10             Match
10         null       Name11           Missing