how do I find a continuos number in dataframe and apply to new column-CodePudding

I have a huge dataframe around 5000 rows, I need to find out how many times a pattern occur in a column and add a new column for it, I am able to use np.where to get the pattern to 1 but I don't know how to count the pattern and add to new column, I did a search online try to use loop but I can't figure out how to use loop with dataframe

df['P'] = np.where((df['val2'] > df['val1']) & (df['val2']> df['val1'].shift(1)),1,0 )

         Date  val1  val2  P     [new column] ( 
0  2015-02-24   294    68  0         0
1  2015-02-25   155    31  0         0
2  2015-02-26   168   290  1         1            pattern occur 1 time
3  2015-02-27   273   108  0         0
4  2015-02-28    55     9  0         0
5  2015-03-01   273   123  0         0
6  2015-03-02   200    46  0         0
7  2015-03-03    80    83  0         0
8  2015-03-04   181   208  1         1            pattern occur 1 time
9  2015-03-05   195    41  0         0
10 2015-03-06    50   261  1         1            pattern occur 1 time
11 2015-03-07    50   177  0         0
12 2015-03-08   215    60  1         0
13 2015-03-09    13   290  1         2            pattern occur 2 times
14 2015-03-10   208    41  0         0
15 2015-03-11    49   263  1         0
16 2015-03-12   171   244  1         0
17 2015-03-13   218   266  1         0
18 2015-03-14   188   219  1         3            pattern occur 3 times
19 2015-03-15   232   171  0         0
20 2015-03-16   116   196  0         0
21 2015-03-17   262   102  0         0
22 2015-03-18   263   159  0         0
23 2015-03-19   227   160  0         0
24 2015-03-20   103   236  1         0
25 2015-03-21    55   104  1         0
26 2015-03-22    97   109  1         0
27 2015-03-23    38   118  1         4            pattern occur 4 times
28 2015-03-24   163   116  0         0
29 2015-03-25   256    16  0         0

CodePudding user response：

I'm sure there is a better way to do this, considering using df.iterrows() is almost never the right way, but it works and you only have 5000 rows of data so efficiency is not paramount.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Date': {0: '2015-02-24', 1: '2015-02-25', 2: '2015-02-26', 3: '2015-02-27', 4: '2015-02-28', 5: '2015-03-01', 6: '2015-03-02', 7: '2015-03-03', 8: '2015-03-04', 9: '2015-03-05', 10: '2015-03-06', 11: '2015-03-07', 12: '2015-03-08', 13: '2015-03-09', 14: '2015-03-10', 15: '2015-03-11', 16: '2015-03-12', 17: '2015-03-13', 18: '2015-03-14', 19: '2015-03-15', 20: '2015-03-16', 21: '2015-03-17', 22: '2015-03-18', 23: '2015-03-19', 24: '2015-03-20', 25: '2015-03-21', 26: '2015-03-22', 27: '2015-03-23', 28: '2015-03-24', 29: '2015-03-25'}, 'val1': {0: 294, 1: 155, 2: 168, 3: 273, 4: 55, 5: 273, 6: 200, 7: 80, 8: 181, 9: 195, 10: 50, 11: 50, 12: 215, 13: 13, 14: 208, 15: 49, 16: 171, 17: 218, 18: 188, 19: 232, 20: 116, 21: 262, 22: 263, 23: 227, 24: 103, 25: 55, 26: 97, 27: 38, 28: 163, 29: 256}, 'val2': {0: 68, 1: 31, 2: 290, 3: 108, 4: 9, 5: 123, 6: 46, 7: 83, 8: 208, 9: 41, 10: 261, 11: 177, 12: 60, 13: 290, 14: 41, 15: 263, 16: 244, 17: 266, 18: 219, 19: 171, 20: 196, 21: 102, 22: 159, 23: 160, 24: 236, 25: 104, 26: 109, 27: 118, 28: 116, 29: 16}})

df['P'] = np.where((df['val2'] > df['val1']) & (df['val2']> df['val1'].shift(1)),1,0 )

df['new_column'] = 0
counter = 0

for i, row, in df.iterrows():
    if row.P == 1:
        counter  = 1
    else:
        counter = 0
    df.loc[i, 'new_column'] = counter

df.new_column = df.new_column * [1 if x == 0 else 0 for x in df.new_column.shift(-1) ]

gives

          Date  val1  val2  P  new_column
0   2015-02-24   294    68  0           0
1   2015-02-25   155    31  0           0
2   2015-02-26   168   290  1           1
3   2015-02-27   273   108  0           0
4   2015-02-28    55     9  0           0
5   2015-03-01   273   123  0           0
6   2015-03-02   200    46  0           0
7   2015-03-03    80    83  0           0
8   2015-03-04   181   208  1           1
9   2015-03-05   195    41  0           0
10  2015-03-06    50   261  1           0
11  2015-03-07    50   177  1           2
12  2015-03-08   215    60  0           0
13  2015-03-09    13   290  1           1
14  2015-03-10   208    41  0           0
15  2015-03-11    49   263  1           0
16  2015-03-12   171   244  1           0
17  2015-03-13   218   266  1           0
18  2015-03-14   188   219  1           4
19  2015-03-15   232   171  0           0
20  2015-03-16   116   196  0           0
21  2015-03-17   262   102  0           0
22  2015-03-18   263   159  0           0
23  2015-03-19   227   160  0           0
24  2015-03-20   103   236  1           0
25  2015-03-21    55   104  1           0
26  2015-03-22    97   109  1           0
27  2015-03-23    38   118  1           4
28  2015-03-24   163   116  0           0
29  2015-03-25   256    16  0           0

CodePudding user response：

you can use:

df['new_column'] = (df.P != df.P.shift()).cumsum() #get an id according to P
mask=df.groupby('new_column')['P'].sum() #what is the total value for each group

duplicated = df.duplicated('new_column',keep='last')
df.loc[~duplicated, ['new_column']] = np.nan #set nan to last rows for each group. We will replace nans with mask

df['new_column'] = df['new_column'].astype(str).replace('\d ', 0,regex=True).replace('nan',np.nan) #convert not nans to zero
mask.index=df[df['new_column'].isnull()].index.to_list()
#If you want to fill the nan values with a series, the index values must be the same. So I replace the index values of the mask series with the index numbers of the nan values in df.

df['new_column']=df['new_column'].fillna(mask).astype(int)
df
'''
          Date  val1  val2  P  new_column
0   2015-02-24   294    68  0           0
1   2015-02-25   155    31  0           0
2   2015-02-26   168   290  1           1
3   2015-02-27   273   108  0           0
4   2015-02-28    55     9  0           0
5   2015-03-01   273   123  0           0
6   2015-03-02   200    46  0           0
7   2015-03-03    80    83  0           0
8   2015-03-04   181   208  1           1
9   2015-03-05   195    41  0           0
10  2015-03-06    50   261  1           1
11  2015-03-07    50   177  0           0
12  2015-03-08   215    60  1           0
13  2015-03-09    13   290  1           2
14  2015-03-10   208    41  0           0
15  2015-03-11    49   263  1           0
16  2015-03-12   171   244  1           0
17  2015-03-13   218   266  1           0
18  2015-03-14   188   219  1           4
19  2015-03-15   232   171  0           0
20  2015-03-16   116   196  0           0
21  2015-03-17   262   102  0           0
22  2015-03-18   263   159  0           0
23  2015-03-19   227   160  0           0
24  2015-03-20   103   236  1           0
25  2015-03-21    55   104  1           0
26  2015-03-22    97   109  1           0
27  2015-03-23    38   118  1           4
28  2015-03-24   163   116  0           0
29  2015-03-25   256    16  0           0

'''