I have a data frame like this:
df = pd.DataFrame({'A': [1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1]})
If I have n continuous ones (in this case n = 8), the gap between next continuous n ones is 4 zeros(I would like set up a rule eg: the gap between continuous number is m <=4 ), how can I replace those 4 zeros with 1?
My ideal out put would be like this:
df = pd.DataFrame({'A': [1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], 'Fill_Gap': [1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1]})
Only four zeros (at index 13-16) replaced by 1 cause they have 8 continuous 1 before and after.
Any advice would be much appreciated!
CodePudding user response:
You can use regex if you join the column into a string. With regex you can search for 4 zeros or less with 0{,4}
and lookbehind ... lookahead for 8 ones with (?<=1{8})...(?=1{8})
. I don't think this is an efficient solution.
import re
df['fill_gap'] = df['A']
for i in re.finditer('(?<=1{8})0{,4}(?=1{8})', ''.join(df.fill_gap.astype('str'))):
df.fill_gap.iloc[slice(*i.span())] = 1
df
Output
A fill_gap
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 1
13 0 1
14 0 1
15 0 1
16 0 1
17 1 1
18 1 1
19 1 1
20 1 1
21 1 1
22 1 1
23 1 1
24 1 1
25 0 0
26 1 1
27 1 1
28 1 1
29 0 0
30 0 0
31 0 0
32 0 0
33 0 0
34 0 0
35 0 0
36 0 0
37 0 0
38 1 1
39 1 1
CodePudding user response:
This will work for series of any length:
df = pd.DataFrame({'A': [1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1]})
#Check for runs of 8 (1's)
lst1=(df.shift(periods=0).A==1)
for x in range(1,8):
lst1=lst1&(df.shift(periods=x).A==1)
#Check for runs of 4 (0's)
lst0=(df.shift(periods=0).A==0)
for x in range(1,4):
lst0=lst0&(df.shift(periods=x).A==0)
#Get index
ones=np.array(list(lst1.index))[lst1]
zeros=np.array(list(lst0.index))[lst0]
#Fill Gaps
for x in list(range(1, len(ones))):
if any(lst0[ones[x-1]:ones[x]]):
lst1[ones[x-1]:ones[x]]=True
#Apply to data frame
df.loc[lst1, 'A']=1
Output:
A
0 1
1 1
2 1
3 0
4 0
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 0
26 1
27 1
28 1
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 1
39 1