Home > Net >  Comparing two dataframes if one contains the other
Comparing two dataframes if one contains the other

Time:01-16

I have two dataframes A and B. For example:

A:

Key | Col1 | Col2 
 A     aa     bb 
 B     bb     bc
 C     cc     bd

B:

Key | Col1 | Col2 
 A     a       b     
 B     ab      c  
 C     cc      b  

Both have one unique row per key(which is a key of the df), and can hold different values for the other columns.

I'm trying to analyzing the difference between the two dataframes.

I want 3 comparisons(possibly 4).

  1. number of notna cells in A per column
  2. number of records that are equal, divided by the number of cells with a notna value in A per column
  3. number of records in B, that is in A string wise(for instance 'aaa' is in 'baaab') per column
  4. some other custom comparison. I.E. Some similarity measure.

I've managed to achieve 1 and 2, using this:

counts = A.notna().sum()
scores = B.eq(A).sum().div(counts)

So for 3, I want to get:

Col1    2
Col2    3

My problem is that I don't know how to achieve a custom comparison like in for my third and forth options.. If both were in the same df, then I could probably do something like this

df[col] = df.apply(lambda x: x[col   '_1'] in x[col   ' 2'], axis=1)

For each column, and sum the result, but that seems complicated and messy. Any suggestion for a cleaner solution with both dataframes apart?

CodePudding user response:

For 3), you can go down to numpy an call np.vectorize to apply python's built-in str.find to all the elements of an array in one shot. The function will return True if the substring is found (not -1) and False, otherwise.

str.find(sub[, start[, end]])
Return the lowest index in the string where substring sub is found within the slice s[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 if sub is not found.

#is B values are substrings of A values ? 
ma = np.vectorize(lambda x, y: x.find(y) != -1)(A, B)
​
res3 = pd.DataFrame(ma, columns=B.columns).set_index("Key").sum()

Output :

print(res3)

Col1    2
Col2    3
dtype: int64

And regarding 4), you may want to use fuzz.ratio with a listcomp the calculate a similarity measure, i.e, the Levenshtein distance between the two strings of each pair/column (A/B).

#pip install fuzzywuzzy
from fuzzywuzzy import fuzz

levenshtein_distance = [[fuzz.ratio(x, y)
                         for x, y in zip(A[col], B[col])]
                         for col in A.columns if col != "Key"]

print(levenshtein_distance)
#[[67, 50, 100], [67, 67, 67]]

CodePudding user response:

For #3, because the brunt of the work is spent on string searching, vectorization is unlikely to solve any critical bottlenecks. Something simple like this should get you what your question asks for:

cols = [label for label in B.columns if label != 'Key']
df3 = pd.DataFrame([[B.loc[r,c] in A.loc[r,c] for c in cols] for r in B.index], columns = cols).sum()

Output:

Col1    2
Col2    3
  • Related