Removing specific records based on sequences of value in one column-CodePudding

According to case: How to remove specific records based on column pattern

Right now I have another issue. Example of df table:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,null,null	2021-09-07 10:49:23
repo	null,null,4,8	2021-09-08 17:49:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

I would like to remove records after specific value, contains pattern: null,null,x,y, but only when I can find sequence: null,null,x,y , x,y,null,null and again null,null,x,y.

x, y are any natural numbers. Column value is a string.

E.g. for the value sequence:

null,null,1,3 1,3,null,null null,null,1,3 - the last two values (records) should be deleted.

Based on this above, expected output should looks like this:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

However, if I have for example table like this one:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,null,null	2021-09-07 10:49:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

There is no modification needed (because the sequence of three values is not preserved).

I'm new to Python and stuck on these table modifications. I will be grateful for any idea.

CodePudding user response：

I first thought of DataFrame.rolling; however, it doesn't accept string column. So I made use of DataFrame.shift twice to group the previous two values together:

#DataFrame initialization
data = {
'event': ['seed', 'ghy', 'repo', 'repo', 'harv', 'weig', 'repo', 'repo', 'repo', 'repo', 'repo'],
'value' :['57', '869', '5324', 'null', '12', '5,37,12', 'null,null,4,8', '4,8,null,null', 'null,null,4,8', '4,8,1,3', '1356'],
'time': ['1/8/2021 9:49', '2/8/2021 9:50', '3/9/2021 10:49', '3/9/2021 11:49', '5/9/2021 9:43', '6/9/2021 9:25', '7/9/2021 9:12', '7/9/2021 10:49', '8/9/2021 17:49', '9/9/2021 12:12', '10/9/2021 12:49']
}

df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])


#Processing
patterns = [r'null,null,\d ,\d ',
            r'\d ,\d ,null,null',
            r'null,null,\d ,\d ',]

screening = 0
for pos, pat in enumerate(patterns):
  screening  = df['value'].shift(pos).str.contains(pat)
screening = (screening == 3)
df = df[~(screening | screening.shift(-1))]
print(df)