Home > Blockchain >  Compare multiple columns and count total unmatched
Compare multiple columns and count total unmatched

Time:08-12

Hi I am comparing a dataset that have multiple paired columns and I want to count how many pairs are not matched. The data might look like this:

df=pd.DataFrame({'ID':['a','b','c'],
                 'A_1':[3,4,5],
                 'B_1':[1,5,7],
                 'A_2':[3,3,5],
                 'B_2':[3,3,7]})

what I want is:

df1=pd.DataFrame({'ID':['a','b','c'],
                 'A_1':[3,4,5],
                 'B_1':[1,5,7],
                 'A_2':[3,3,5],
                 'B_2':[3,3,7],
                 'Count':[1,2,0]})

For each ID, I compare A_1 with A_2, B_1 with B_2, and count the total unmatched results. In my real dataset, I have 15 pairs and I certainly don't want to compare manually.

alist=[A_1,A_2,B_1,B_2]

def match_count(df):
    Non_match_item_count=0
    for i in range(0,len(alist),2):
        if round(df[i],1)!=round(df[i 1],1):
            Non_match_item_count =1
    return Non_match_item_count

df['Count']=df.apply(match_count,axis=1)

It is not working...

CodePudding user response:

There is probably a fancy pandas function which will satisfy your needs, but here is the approach I would use. given df:

    ID  A_1 B_1 A_2 B_2
0   a   3   1   3   3
1   b   4   5   3   3
2   c   5   7   5   7  

I would utilize the Collections.Counter to count occurrences of items in each row and determine the number having only 1 entry as follows: from Collections import Counter

def compute_mismatches(df):
    rslt = []
    for r in range(df.shape[0]):
        rc = Counter(df.iloc[r].to_list()[1:])
        k = 0
        for ky, val in rc.items():
            if val == 1:
                k  =1
        rslt.append(k)
    return rslt   

Then using this function:

df['Counts'] = compute_mismatches(df)  

Yields:

    ID  A_1 B_1 A_2 B_2 Counts
0   a   3   1   3   3   0
1   b   4   5   3   3   3
2   c   5   7   5   7   1

CodePudding user response:

I believe this should work:

df['Count'] = (df.set_index('ID')
.groupby(lambda x: x[0],axis=1)
.diff()
.abs()
.gt(0)
.sum(axis=1)
.reset_index(drop=True))

CodePudding user response:

IIUC, you can extract the letter, groupby it and count the occurrences of non unique values:

cols = df.drop(columns='ID').columns
  
df['Count'] = (df[cols] 
              .groupby(cols.str.extract('(^[^_] )', expand=False), axis=1)
              .nunique(1).ne(1).sum(1)
              )

Output:

  ID  A_1  B_1  A_2  B_2  Count
0  a    3    1    3    3      1
1  b    4    5    3    3      2
2  c    5    7    5    7      0
  • Related