Python: counting the number of combinations dataframe-CodePudding

what is an easy way to count the number of combinations in across two columns? Given this dataframe:

df =
id testA testB
1  3     NA
1  1     3
2  2     NA
2  NA    1
2  0     0
3  NA    NA
3  1     1

I would like to know the different combinations independently of the score. For example:

Both tests: 3
A but not B: 2
B but not A: 1

CodePudding user response：

For two columns, can query each condition separately:

a_exists = df["testA"].notna()
b_exists = df["testB"].notna()

# both
>>> (a_exists & b_exists).sum()
3

# A, but not B
>>> (a_exists & ~b_exists).sum()
2

# B, but not A
>>> (~a_exists & b_exists).sum()
1

But can be automated with some itertools:

from itertools import compress, product

cols = ["A", "B"]
for ma, mb in product([0, 1], repeat=2):
    if ma == mb == 1: continue
    ab_info =  "".join(compress(cols, (1 - ma, 1 - mb)))
    counts  = ((a_exists ^ ma) & (b_exists ^ mb)).sum()

    print(ab_info, counts)

get "selector"s over [0, 1] x 2
if both are 1s, i.e., neither doesn't exist, skip the selector
otherwise
- get the selected parties with compress
- see whether to negate or not through bitwise XOR and AND the results
- sum to get the total count

which prints

AB 3
A 2
B 1

CodePudding user response：

Apply a notna() call to both tests then call value_counts:

result = df[["testA", "testB"]].notna().value_counts()
result.index = result.index.map({
    (True, True): "Both A and B",
    (True, False): "A but not B",
    (False, True): "B but not A",
    (False, False): "Neither A nor B"
})

Result:

Both A and B       3
A but not B        2
Neither A nor B    1
B but not A        1

CodePudding user response：

Following fmarc's answer to How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0, we can convert the dataframe to contain only 0s and 1s.

    df = df.notnull().astype('int')

Then I replace the 0s and 1s in the column 'testA' with 'not A' and 'A'. I repeat the similar thing for the column 'testB'.

    df['testA'].replace(1, 'A', inplace=True)
    df['testA'].replace(0, 'not A', inplace=True)
    df['testB'].replace(1, 'B', inplace=True)
    df['testB'].replace(0, 'not B', inplace=True)

I do this to simplify our next step, which is to add the two strings in 'testA' and 'testB' and get their value_counts:

    df['sum'] = df['testA']   '   '  newdf['testB']
    df['sum'].value_counts()

The last line of code should yield your desired result. Here's what I got: Input:


   id   testA   testB
0   1   3.0 NaN
1   1   1.0 3.0
2   2   2.0 NaN
3   2   NaN 1.0
4   2   0.0 0.0
5   3   NaN NaN
6   3   1.0 1.0

Output:

A   B            3
A   not B        2
not A   B        1
not A   not B    1
Name: sum, dtype: int64