I have a data frame with header and row names. Starting from the second column I would like to calculate the number of different observations (in that case 0 and 1 only) and divide them by the number of rows multiplied by two...
I am able to do this using COUNTIF from excel...but I was wondering if there is a way to do this in Python using pandas COUNTIF...
The data frame subset is as follows:
position 1X0812.0 1X0812.1 10192.0 10192.1 10316.0 10316.1 10349.0 10349.1 10418.0 10418.1 10482.0 10482.1
4035 0 0 0 0 0 0 0 1 0 0 0 0
4036 0 0 0 0 0 0 0 1 0 0 0 0
4083 0 0 0 0 1 0 0 1 0 0 0 0
4119 0 1 0 0 1 0 0 1 0 0 0 0
4164 0 1 0 0 1 0 0 1 0 0 0 0
4185 1 1 0 1 1 0 0 1 0 0 0 0
4379 1 1 0 1 0 0 0 1 0 0 0 0
The desired output would be something like this in a new file:
0 0.5714 0.8571 0.7142 0.5000 1.0000 1.0000
1 0.4285 0.1428 0.2857 0.5000 0.0000 0.0000
Looking over the internet what could only get the sum of one observation (0 in this case) for the whole dataset...
import pandas as pd
df = pd.read_csv('query.tsv', sep='\t')
(df == 0).sum()
result:
position 0
10046.0 309027
10046.1 308117
10192.0 308117
10192.1 309027
...
9656.1 171617
9860.0 261170
9860.1 226217
9878.0 309027
9878.1 309027
Length: 1565, dtype: int64
thanks
CodePudding user response:
Counting the 1
s among the 0
s is equivalent to summing. So we can group the dataframe starting from the second column by the first part of the columns headers (before the .
) and divide the sum by the count (alternatively you can directly use mean
for the same).
g = df.iloc[:,1:].groupby(df.columns[1:].str.extract(r'(\d )\.', expand=False), axis=1)
s = g.sum().sum() / g.count().sum() # or s = g.mean().mean()
s
is an unnamed Series representing the share of 1
s:
10046 0.428571
10192 0.142857
10316 0.285714
10349 0.500000
10418 0.000000
10482 0.000000
dtype: float64
From this Series we construct the desired resulting dataframe:
result = pd.concat([(1 - s).rename(0), s.rename(1)], axis=1).T
Result:
10046 10192 10316 10349 10418 10482
0 0.571429 0.857143 0.714286 0.5 1.0 1.0
1 0.428571 0.142857 0.285714 0.5 0.0 0.0