Hi I am comparing a dataset that have multiple paired columns and I want to count how many pairs are not matched. The data might look like this:
df=pd.DataFrame({'ID':['a','b','c'],
'A_1':[3,4,5],
'B_1':[1,5,7],
'A_2':[3,3,5],
'B_2':[3,3,7]})
what I want is:
df1=pd.DataFrame({'ID':['a','b','c'],
'A_1':[3,4,5],
'B_1':[1,5,7],
'A_2':[3,3,5],
'B_2':[3,3,7],
'Count':[1,2,0]})
For each ID, I compare A_1 with A_2, B_1 with B_2, and count the total unmatched results. In my real dataset, I have 15 pairs and I certainly don't want to compare manually.
alist=[A_1,A_2,B_1,B_2]
def match_count(df):
Non_match_item_count=0
for i in range(0,len(alist),2):
if round(df[i],1)!=round(df[i 1],1):
Non_match_item_count =1
return Non_match_item_count
df['Count']=df.apply(match_count,axis=1)
It is not working...
CodePudding user response:
There is probably a fancy pandas function which will satisfy your needs, but here is the approach I would use. given df:
ID A_1 B_1 A_2 B_2
0 a 3 1 3 3
1 b 4 5 3 3
2 c 5 7 5 7
I would utilize the Collections.Counter to count occurrences of items in each row and determine the number having only 1 entry as follows: from Collections import Counter
def compute_mismatches(df):
rslt = []
for r in range(df.shape[0]):
rc = Counter(df.iloc[r].to_list()[1:])
k = 0
for ky, val in rc.items():
if val == 1:
k =1
rslt.append(k)
return rslt
Then using this function:
df['Counts'] = compute_mismatches(df)
Yields:
ID A_1 B_1 A_2 B_2 Counts
0 a 3 1 3 3 0
1 b 4 5 3 3 3
2 c 5 7 5 7 1
CodePudding user response:
I believe this should work:
df['Count'] = (df.set_index('ID')
.groupby(lambda x: x[0],axis=1)
.diff()
.abs()
.gt(0)
.sum(axis=1)
.reset_index(drop=True))
CodePudding user response:
IIUC, you can extract the letter, groupby it and count the occurrences of non unique values:
cols = df.drop(columns='ID').columns
df['Count'] = (df[cols]
.groupby(cols.str.extract('(^[^_] )', expand=False), axis=1)
.nunique(1).ne(1).sum(1)
)
Output:
ID A_1 B_1 A_2 B_2 Count
0 a 3 1 3 3 1
1 b 4 5 3 3 2
2 c 5 7 5 7 0