Home > Enterprise >  Count the number of different observations for every two columns and divide by the number of rows mu
Count the number of different observations for every two columns and divide by the number of rows mu

Time:08-26

I have a data frame with header and row names. Starting from the second column I would like to calculate the number of different observations (in that case 0 and 1 only) and divide them by the number of rows multiplied by two...

I am able to do this using COUNTIF from excel...but I was wondering if there is a way to do this in Python using pandas COUNTIF...

The data frame subset is as follows:

position    1X0812.0    1X0812.1    10192.0 10192.1 10316.0 10316.1 10349.0 10349.1 10418.0 10418.1 10482.0 10482.1
4035    0   0   0   0   0   0   0   1   0   0   0   0
4036    0   0   0   0   0   0   0   1   0   0   0   0
4083    0   0   0   0   1   0   0   1   0   0   0   0
4119    0   1   0   0   1   0   0   1   0   0   0   0
4164    0   1   0   0   1   0   0   1   0   0   0   0
4185    1   1   0   1   1   0   0   1   0   0   0   0
4379    1   1   0   1   0   0   0   1   0   0   0   0


The desired output would be something like this in a new file:

0 0.5714 0.8571 0.7142 0.5000 1.0000 1.0000
1 0.4285 0.1428 0.2857 0.5000 0.0000 0.0000

Looking over the internet what could only get the sum of one observation (0 in this case) for the whole dataset...

import pandas as pd
df = pd.read_csv('query.tsv', sep='\t')
(df == 0).sum()

result:

position         0
10046.0     309027
10046.1     308117
10192.0     308117
10192.1     309027
             ...  
9656.1      171617
9860.0      261170
9860.1      226217
9878.0      309027
9878.1      309027
Length: 1565, dtype: int64

thanks

CodePudding user response:

Counting the 1s among the 0s is equivalent to summing. So we can group the dataframe starting from the second column by the first part of the columns headers (before the .) and divide the sum by the count (alternatively you can directly use mean for the same).

g = df.iloc[:,1:].groupby(df.columns[1:].str.extract(r'(\d )\.', expand=False), axis=1)
s = g.sum().sum() / g.count().sum()  # or s = g.mean().mean()

s is an unnamed Series representing the share of 1s:

10046    0.428571
10192    0.142857
10316    0.285714
10349    0.500000
10418    0.000000
10482    0.000000
dtype: float64

From this Series we construct the desired resulting dataframe:

result = pd.concat([(1 - s).rename(0), s.rename(1)], axis=1).T

Result:

      10046     10192     10316  10349  10418  10482
0  0.571429  0.857143  0.714286    0.5    1.0    1.0
1  0.428571  0.142857  0.285714    0.5    0.0    0.0
  • Related