Comparing two dataframes columns-CodePudding

I have two dataframes that are structure wise equal.

Both has to following format:

file_name | country_name | country_code | .....

I want to compare the two, and get a percentage of equality for each column.

The second data frame is the test dataframe, that holds the true values. Some of the values are NaN, which should be ignored. So far I have a managed to compare the two, and get the total number of equal samples for each column, my problem is dividing each of them by the total number of relevant samples(That doesn't have NaN in the second dataframe), in a "nice way".

For example:

df1

file_name | country_name 
    1            a
    2            b
    3            d
    4            c

df2

file_name | country_name 
    1            a
    2            b
    3            nan
    4            d

I expect an output of 66% for this column, because 2 of the 3 relevant samples has the same value, and the 4th is nan so it is ignored from the calculation.

What I've done so far:

test_set = pd.read_excel(file_path)
test_set = test_set.astype(str)
a_set = pd.read_excel(file2_path)
a_set = a_set.astype(str)
merged_df = a_set.merge(test_set, on='file_name')
for field in fields:
    if field == 'file_name':
        continue
    merged_df[field] = merged_df.apply(lambda x: 0 if x[field   '_y'] == 'nan' else 1 if x[field   '_x'].lower() == x[field   '_y'].lower() else 0, axis=1)

scores = merged_df.drop('file_name', axis=1).sum(axis=0)

This gives me these(correct) results:

country_name      14
country_code       0
state_name         4
state_code        59
city              74
...

But now I want to divide each of them by the total number of samples that doesn't contain NaN in the corresponding field from the test_set dataframe. I can think of naive ways to do this, like creating another column that holds the number of not nan values for each of these column, but looking for a pretty solution.

CodePudding user response：

As you have unique filenames I would use all vectorial operations, take advantage of index alignment:

# set the filename as index
df1b = df1.set_index('file_name')
# set the filename as index
df2b = df2.set_index('file_name')

# compare and divide by the number of non-NA
out = df1b.eq(df2b).sum().div(df2b.notna().sum())

Output:

country_name    0.666667
dtype: float64

CodePudding user response：

If you don't have to merge you could use:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([
    ["1", "a"],
    ["2", np.NAN],
    ["3", "c"]
])

df2 = pd.DataFrame([
    ["1", "X"],
    ["100", "b"],
    ["3", "c"]
])

# expected:
# col 0: equal = 2, ratio: 2/3
# col 1: equal = 1, ratio: 1/2

df1 = df1.sort_index()
df2 = df2.sort_index()

def get_col_ratio(col):
    colA = df1[col]
    colB = df2[col]

    colA_ = colA[~(colA.isna() | colB.isna())]
    colB_ = colB[~(colA.isna() | colB.isna())]

    return (colA_.str.lower() == colB_.str.lower()).sum() / len(colA_)

ratios = pd.DataFrame([[get_col_ratio(i) for i in df1.columns]], columns=df1.columns)
print(ratios)

Or, using pd.merge

fields = df1.columns
merged = pd.merge(df1,df2, left_index=True, right_index=True)

def get_ratio(col):
    cols = merged[[f"{col}_x",f"{col}_y"]]
    cols = cols.dropna()
    equal_rows = cols[cols.columns[0]].str.lower() == cols[cols.columns[1]].str.lower()
    return equal_rows.sum() / len(cols)

ratios = pd.DataFrame([[get_ratio(i) for i in fields]], columns=fields)
ratios