I have a huge dataframe around 5000 rows, I need to find out how many times a pattern occur in a column and add a new column for it, I am able to use np.where to get the pattern to 1 but I don't know how to count the pattern and add to new column, I did a search online try to use loop but I can't figure out how to use loop with dataframe
df['P'] = np.where((df['val2'] > df['val1']) & (df['val2']> df['val1'].shift(1)),1,0 )
Date val1 val2 P [new column] (
0 2015-02-24 294 68 0 0
1 2015-02-25 155 31 0 0
2 2015-02-26 168 290 1 1 pattern occur 1 time
3 2015-02-27 273 108 0 0
4 2015-02-28 55 9 0 0
5 2015-03-01 273 123 0 0
6 2015-03-02 200 46 0 0
7 2015-03-03 80 83 0 0
8 2015-03-04 181 208 1 1 pattern occur 1 time
9 2015-03-05 195 41 0 0
10 2015-03-06 50 261 1 1 pattern occur 1 time
11 2015-03-07 50 177 0 0
12 2015-03-08 215 60 1 0
13 2015-03-09 13 290 1 2 pattern occur 2 times
14 2015-03-10 208 41 0 0
15 2015-03-11 49 263 1 0
16 2015-03-12 171 244 1 0
17 2015-03-13 218 266 1 0
18 2015-03-14 188 219 1 3 pattern occur 3 times
19 2015-03-15 232 171 0 0
20 2015-03-16 116 196 0 0
21 2015-03-17 262 102 0 0
22 2015-03-18 263 159 0 0
23 2015-03-19 227 160 0 0
24 2015-03-20 103 236 1 0
25 2015-03-21 55 104 1 0
26 2015-03-22 97 109 1 0
27 2015-03-23 38 118 1 4 pattern occur 4 times
28 2015-03-24 163 116 0 0
29 2015-03-25 256 16 0 0
CodePudding user response:
I'm sure there is a better way to do this, considering using df.iterrows()
is almost never the right way, but it works and you only have 5000 rows of data so efficiency is not paramount.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': {0: '2015-02-24', 1: '2015-02-25', 2: '2015-02-26', 3: '2015-02-27', 4: '2015-02-28', 5: '2015-03-01', 6: '2015-03-02', 7: '2015-03-03', 8: '2015-03-04', 9: '2015-03-05', 10: '2015-03-06', 11: '2015-03-07', 12: '2015-03-08', 13: '2015-03-09', 14: '2015-03-10', 15: '2015-03-11', 16: '2015-03-12', 17: '2015-03-13', 18: '2015-03-14', 19: '2015-03-15', 20: '2015-03-16', 21: '2015-03-17', 22: '2015-03-18', 23: '2015-03-19', 24: '2015-03-20', 25: '2015-03-21', 26: '2015-03-22', 27: '2015-03-23', 28: '2015-03-24', 29: '2015-03-25'}, 'val1': {0: 294, 1: 155, 2: 168, 3: 273, 4: 55, 5: 273, 6: 200, 7: 80, 8: 181, 9: 195, 10: 50, 11: 50, 12: 215, 13: 13, 14: 208, 15: 49, 16: 171, 17: 218, 18: 188, 19: 232, 20: 116, 21: 262, 22: 263, 23: 227, 24: 103, 25: 55, 26: 97, 27: 38, 28: 163, 29: 256}, 'val2': {0: 68, 1: 31, 2: 290, 3: 108, 4: 9, 5: 123, 6: 46, 7: 83, 8: 208, 9: 41, 10: 261, 11: 177, 12: 60, 13: 290, 14: 41, 15: 263, 16: 244, 17: 266, 18: 219, 19: 171, 20: 196, 21: 102, 22: 159, 23: 160, 24: 236, 25: 104, 26: 109, 27: 118, 28: 116, 29: 16}})
df['P'] = np.where((df['val2'] > df['val1']) & (df['val2']> df['val1'].shift(1)),1,0 )
df['new_column'] = 0
counter = 0
for i, row, in df.iterrows():
if row.P == 1:
counter = 1
else:
counter = 0
df.loc[i, 'new_column'] = counter
df.new_column = df.new_column * [1 if x == 0 else 0 for x in df.new_column.shift(-1) ]
gives
Date val1 val2 P new_column
0 2015-02-24 294 68 0 0
1 2015-02-25 155 31 0 0
2 2015-02-26 168 290 1 1
3 2015-02-27 273 108 0 0
4 2015-02-28 55 9 0 0
5 2015-03-01 273 123 0 0
6 2015-03-02 200 46 0 0
7 2015-03-03 80 83 0 0
8 2015-03-04 181 208 1 1
9 2015-03-05 195 41 0 0
10 2015-03-06 50 261 1 0
11 2015-03-07 50 177 1 2
12 2015-03-08 215 60 0 0
13 2015-03-09 13 290 1 1
14 2015-03-10 208 41 0 0
15 2015-03-11 49 263 1 0
16 2015-03-12 171 244 1 0
17 2015-03-13 218 266 1 0
18 2015-03-14 188 219 1 4
19 2015-03-15 232 171 0 0
20 2015-03-16 116 196 0 0
21 2015-03-17 262 102 0 0
22 2015-03-18 263 159 0 0
23 2015-03-19 227 160 0 0
24 2015-03-20 103 236 1 0
25 2015-03-21 55 104 1 0
26 2015-03-22 97 109 1 0
27 2015-03-23 38 118 1 4
28 2015-03-24 163 116 0 0
29 2015-03-25 256 16 0 0
CodePudding user response:
you can use:
df['new_column'] = (df.P != df.P.shift()).cumsum() #get an id according to P
mask=df.groupby('new_column')['P'].sum() #what is the total value for each group
duplicated = df.duplicated('new_column',keep='last')
df.loc[~duplicated, ['new_column']] = np.nan #set nan to last rows for each group. We will replace nans with mask
df['new_column'] = df['new_column'].astype(str).replace('\d ', 0,regex=True).replace('nan',np.nan) #convert not nans to zero
mask.index=df[df['new_column'].isnull()].index.to_list()
#If you want to fill the nan values with a series, the index values must be the same. So I replace the index values of the mask series with the index numbers of the nan values in df.
df['new_column']=df['new_column'].fillna(mask).astype(int)
df
'''
Date val1 val2 P new_column
0 2015-02-24 294 68 0 0
1 2015-02-25 155 31 0 0
2 2015-02-26 168 290 1 1
3 2015-02-27 273 108 0 0
4 2015-02-28 55 9 0 0
5 2015-03-01 273 123 0 0
6 2015-03-02 200 46 0 0
7 2015-03-03 80 83 0 0
8 2015-03-04 181 208 1 1
9 2015-03-05 195 41 0 0
10 2015-03-06 50 261 1 1
11 2015-03-07 50 177 0 0
12 2015-03-08 215 60 1 0
13 2015-03-09 13 290 1 2
14 2015-03-10 208 41 0 0
15 2015-03-11 49 263 1 0
16 2015-03-12 171 244 1 0
17 2015-03-13 218 266 1 0
18 2015-03-14 188 219 1 4
19 2015-03-15 232 171 0 0
20 2015-03-16 116 196 0 0
21 2015-03-17 262 102 0 0
22 2015-03-18 263 159 0 0
23 2015-03-19 227 160 0 0
24 2015-03-20 103 236 1 0
25 2015-03-21 55 104 1 0
26 2015-03-22 97 109 1 0
27 2015-03-23 38 118 1 4
28 2015-03-24 163 116 0 0
29 2015-03-25 256 16 0 0
'''