Home > Mobile >  Removing specific records based on sequences of value in one column
Removing specific records based on sequences of value in one column

Time:10-05

According to case: How to remove specific records based on column pattern

Right now I have another issue. Example of df table:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,null,null 2021-09-07 10:49:23
repo null,null,4,8 2021-09-08 17:49:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

I would like to remove records after specific value, contains pattern: null,null,x,y, but only when I can find sequence: null,null,x,y , x,y,null,null and again null,null,x,y.

x, y are any natural numbers. Column value is a string.

E.g. for the value sequence:

null,null,1,3 1,3,null,null null,null,1,3 - the last two values ​​(records) should be deleted.

Based on this above, expected output should looks like this:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

However, if I have for example table like this one:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,null,null 2021-09-07 10:49:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

There is no modification needed (because the sequence of three values ​​is not preserved).

I'm new to Python and stuck on these table modifications. I will be grateful for any idea.

CodePudding user response:

I first thought of DataFrame.rolling; however, it doesn't accept string column. So I made use of DataFrame.shift twice to group the previous two values together:

#DataFrame initialization
data = {
'event': ['seed', 'ghy', 'repo', 'repo', 'harv', 'weig', 'repo', 'repo', 'repo', 'repo', 'repo'],
'value' :['57', '869', '5324', 'null', '12', '5,37,12', 'null,null,4,8', '4,8,null,null', 'null,null,4,8', '4,8,1,3', '1356'],
'time': ['1/8/2021 9:49', '2/8/2021 9:50', '3/9/2021 10:49', '3/9/2021 11:49', '5/9/2021 9:43', '6/9/2021 9:25', '7/9/2021 9:12', '7/9/2021 10:49', '8/9/2021 17:49', '9/9/2021 12:12', '10/9/2021 12:49']
}

df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])


#Processing
patterns = [r'null,null,\d ,\d ',
            r'\d ,\d ,null,null',
            r'null,null,\d ,\d ',]

screening = 0
for pos, pat in enumerate(patterns):
  screening  = df['value'].shift(pos).str.contains(pat)
screening = (screening == 3)
df = df[~(screening | screening.shift(-1))]
print(df)
  • Related