Given a dataframe, how do I calculate the probability of consecutive events using python pandas?
For example,
Time | A | B | C |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | -1 | -1 | -1 |
3 | 1 | 1 | 1 |
4 | -1 | -1 | -1 |
5 | 1 | 1 | 1 |
6 | -1 | -1 | -1 |
7 | 1 | 1 | 1 |
8 | -1 | 1 | 1 |
9 | 1 | -1 | 1 |
10 | -1 | 1 | -1 |
In this dataframe, B has two consecutive "1" in t=7 and t=8, and C has three consecutive "1" in t=7 to to=9.
Probability of event that two consecutive "1" appear is 3/27 Probability of event that three consecutive "1" appear is 1/24
How can I do this using python pandas?
CodePudding user response:
Try this code(It can be used in other dataframes i.e. more columns, rows)
def consecutive(num):
'''
df = pd.DataFrame({
'Time' : [i for i in range(1, 11)],
'A' : [1, -1, 1, -1, 1, -1, 1, -1, 1, -1],
'B' : [1, -1, 1, -1, 1, -1, 1, 1, -1, 1],
'C' : [1, -1, 1, -1, 1, -1, 1, 1, 1, -1]
})
print(df)
'''
row_num = df.shape[0]
col_num = df.shape[1]
cnt = 0 # the number of consecutives
for col_index in range(1, col_num): # counting for each column
col_tmp = df.iloc[:, col_index]
consec = 0
for i in range(row_num):
if col_tmp[i] == 1:
consec = 1
# if -1 comes after 1, then consec = 0
else:
consec = 0
# to simply sum with the condition(consec == num), we minus 1 from consec
if consec == num:
cnt = 1
consec -= 1
all_cases = (row_num - num 1) * (col_num - 1) # col_num - 1 because of 'Time' column
prob = cnt / all_cases
return prob
When you execute it with the given dataframe with this code
print(f'two consectuvie : {consecutive(2)}')
print(f'three consectuvie : {consecutive(3)}')
Output :
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
two consectuvie : 0.1111111111111111
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
three consectuvie : 0.041666666666666664
CodePudding user response:
You can compare rows with previous rows using shift
. So, to find out how often two consecutive values are equal, you can do
>>> (df.C == df.C.shift()).sum()
2
To find three consecutive equal values, you'd have to compare the column with itself shifted by 1 (the default) and additionally, shifted by 2.
>>> ((df.C == df.C.shift()) & (df.C == df.C.shift(2))).sum()
1
Another variation of this using the pd.Series.eq
function instead of the ==
is:
>>> m = df.C.eq(df.C.shift(1)) & df.C.eq(df.C.shift(2))
>>> m.sum()
1
In this case, since the target value is 1 (and True == 1
is True
; it won't work for other target values as is, see below), the pattern can be generalized with functools.reduce
to:
from functools import reduce
def combos(column, n):
return reduce(pd.Series.eq, [column.shift(i) for i in range(n)])
You can apply this function to df
like so, which will give you the numerator:
>>> df[['A', 'B', 'C']].apply(combos, n = 2).values.sum()
3
>>> df[['A', 'B', 'C']].apply(combos, n = 3).values.sum()
1
To get the denominator, you can do, e.g.,
n = 2
rows, cols = df[['A', 'B', 'C']].shape
denominator = (rows - n 1) * cols
An idea for a generalized version of the combos
function that should work with other target values is
from operator import and_ # equivalent of &
def combos_generalized(col, n):
return reduce(and_, [col == col.shift(i) for i in range(1, n)])