Calculating probability of consecutive events with python pandas-CodePudding

Given a dataframe, how do I calculate the probability of consecutive events using python pandas?

For example,

Time	A	B	C
1	1	1	1
2	-1	-1	-1
3	1	1	1
4	-1	-1	-1
5	1	1	1
6	-1	-1	-1
7	1	1	1
8	-1	1	1
9	1	-1	1
10	-1	1	-1

In this dataframe, B has two consecutive "1" in t=7 and t=8, and C has three consecutive "1" in t=7 to to=9.

Probability of event that two consecutive "1" appear is 3/27 Probability of event that three consecutive "1" appear is 1/24

How can I do this using python pandas?

CodePudding user response：

Try this code(It can be used in other dataframes i.e. more columns, rows)

def consecutive(num):
    '''
    df = pd.DataFrame({
        'Time' : [i for i in range(1, 11)],
        'A' : [1, -1, 1, -1, 1, -1, 1, -1, 1, -1],
        'B' : [1, -1, 1, -1, 1, -1, 1, 1, -1, 1],
        'C' : [1, -1, 1, -1, 1, -1, 1, 1, 1, -1]
        })
    print(df)
    '''

    row_num = df.shape[0]
    col_num = df.shape[1]

    cnt = 0   # the number of consecutives
    for col_index in range(1, col_num):    # counting for each column
        col_tmp = df.iloc[:, col_index]
        consec = 0
        for i in range(row_num):

            if col_tmp[i] == 1:
                consec  = 1

            # if -1 comes after 1, then consec = 0
            else:
                consec = 0

            # to simply sum with the condition(consec == num), we minus 1 from consec
            if consec == num:
                cnt  = 1
                consec -= 1


    all_cases = (row_num - num   1) * (col_num - 1)   # col_num - 1 because of 'Time' column

    prob = cnt / all_cases

    return prob

When you execute it with the given dataframe with this code

print(f'two consectuvie : {consecutive(2)}')
print(f'three consectuvie : {consecutive(3)}')

Output :

   Time  A  B  C
0     1  1  1  1
1     2 -1 -1 -1
2     3  1  1  1
3     4 -1 -1 -1
4     5  1  1  1
5     6 -1 -1 -1
6     7  1  1  1
7     8 -1  1  1
8     9  1 -1  1
9    10 -1  1 -1
two consectuvie : 0.1111111111111111
   Time  A  B  C
0     1  1  1  1
1     2 -1 -1 -1
2     3  1  1  1
3     4 -1 -1 -1
4     5  1  1  1
5     6 -1 -1 -1
6     7  1  1  1
7     8 -1  1  1
8     9  1 -1  1
9    10 -1  1 -1
three consectuvie : 0.041666666666666664

CodePudding user response：

You can compare rows with previous rows using shift. So, to find out how often two consecutive values are equal, you can do

>>> (df.C == df.C.shift()).sum()
2

To find three consecutive equal values, you'd have to compare the column with itself shifted by 1 (the default) and additionally, shifted by 2.

>>> ((df.C == df.C.shift()) & (df.C == df.C.shift(2))).sum()
1

Another variation of this using the pd.Series.eq function instead of the == is:

>>> m = df.C.eq(df.C.shift(1)) & df.C.eq(df.C.shift(2))
>>> m.sum()
1

In this case, since the target value is 1 (and True == 1 is True; it won't work for other target values as is, see below), the pattern can be generalized with functools.reduce to:

from functools import reduce

def combos(column, n):
    return reduce(pd.Series.eq, [column.shift(i) for i in range(n)])

You can apply this function to df like so, which will give you the numerator:

>>> df[['A', 'B', 'C']].apply(combos, n = 2).values.sum()
3
>>> df[['A', 'B', 'C']].apply(combos, n = 3).values.sum()
1

To get the denominator, you can do, e.g.,

n = 2
rows, cols = df[['A', 'B', 'C']].shape
denominator = (rows - n   1) * cols

An idea for a generalized version of the combos function that should work with other target values is

from operator import and_ # equivalent of &

def combos_generalized(col, n):
    return reduce(and_, [col == col.shift(i) for i in range(1, n)])