According to case: How to remove specific records based on column pattern
Right now I have another issue.
Example of df
table:
event | value | time |
---|---|---|
seed | 57 | 2021-08-01 09:49:23 |
ghy | 869 | 2021-08-02 09:50:12 |
repo | 5324 | 2021-09-03 10:49:23 |
repo | null | 2021-09-03 11:49:23 |
harv | 12 | 2021-09-05 09:43:23 |
weig | 5,37,12 | 2021-09-06 09:25:12 |
repo | null,null,4,8 | 2021-09-07 09:12:23 |
repo | 4,8,null,null | 2021-09-07 10:49:23 |
repo | null,null,4,8 | 2021-09-08 17:49:23 |
repo | 4,8,1,3 | 2021-09-09 12:12:23 |
repo | 1356 | 2021-09-10 12:49:23 |
I would like to remove records after specific value, contains pattern: null,null,x,y
, but only when I can find sequence: null,null,x,y
, x,y,null,null
and again null,null,x,y
.
x, y
are any natural numbers. Column value
is a string.
E.g. for the value
sequence:
null,null,1,3
1,3,null,null
null,null,1,3
- the last two values (records) should be deleted.
Based on this above, expected output should looks like this:
event | value | time |
---|---|---|
seed | 57 | 2021-08-01 09:49:23 |
ghy | 869 | 2021-08-02 09:50:12 |
repo | 5324 | 2021-09-03 10:49:23 |
repo | null | 2021-09-03 11:49:23 |
harv | 12 | 2021-09-05 09:43:23 |
weig | 5,37,12 | 2021-09-06 09:25:12 |
repo | null,null,4,8 | 2021-09-07 09:12:23 |
repo | 4,8,1,3 | 2021-09-09 12:12:23 |
repo | 1356 | 2021-09-10 12:49:23 |
However, if I have for example table like this one:
event | value | time |
---|---|---|
seed | 57 | 2021-08-01 09:49:23 |
ghy | 869 | 2021-08-02 09:50:12 |
repo | 5324 | 2021-09-03 10:49:23 |
repo | null | 2021-09-03 11:49:23 |
harv | 12 | 2021-09-05 09:43:23 |
weig | 5,37,12 | 2021-09-06 09:25:12 |
repo | null,null,4,8 | 2021-09-07 09:12:23 |
repo | 4,8,null,null | 2021-09-07 10:49:23 |
repo | 4,8,1,3 | 2021-09-09 12:12:23 |
repo | 1356 | 2021-09-10 12:49:23 |
There is no modification needed (because the sequence of three values is not preserved).
I'm new to Python and stuck on these table modifications. I will be grateful for any idea.
CodePudding user response:
I first thought of DataFrame.rolling
; however, it doesn't accept string column. So I made use of DataFrame.shift
twice to group the previous two values together:
#DataFrame initialization
data = {
'event': ['seed', 'ghy', 'repo', 'repo', 'harv', 'weig', 'repo', 'repo', 'repo', 'repo', 'repo'],
'value' :['57', '869', '5324', 'null', '12', '5,37,12', 'null,null,4,8', '4,8,null,null', 'null,null,4,8', '4,8,1,3', '1356'],
'time': ['1/8/2021 9:49', '2/8/2021 9:50', '3/9/2021 10:49', '3/9/2021 11:49', '5/9/2021 9:43', '6/9/2021 9:25', '7/9/2021 9:12', '7/9/2021 10:49', '8/9/2021 17:49', '9/9/2021 12:12', '10/9/2021 12:49']
}
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])
#Processing
patterns = [r'null,null,\d ,\d ',
r'\d ,\d ,null,null',
r'null,null,\d ,\d ',]
screening = 0
for pos, pat in enumerate(patterns):
screening = df['value'].shift(pos).str.contains(pat)
screening = (screening == 3)
df = df[~(screening | screening.shift(-1))]
print(df)