How to remove specific records based on column pattern-CodePudding

I have a table like this:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,null,null	2021-09-07 10:49:23
repo	null,null,4,8	2021-09-08 17:49:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

Sometimes the value column has the following pattern: null, null, x, y, where x and y are any natural numbers.

Do you know how to delete all pairs of records from the diagram: x, y, null, null and then again null, null, x, y immediately after the first occurrence of such a pattern?

I mean the expected output should be:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	5,37,12	2021-09-06 09:25:12
repo	null,null,4,8	2021-09-07 09:12:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

When according to one of the answers, I use it:

import numpy as np

df['value'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(','))))

df.drop_duplicates(['value'], keep='first')

I get:

event	value	time
seed	57	2021-08-01 09:49:23
ghy	869	2021-08-02 09:50:12
repo	5324	2021-09-03 10:49:23
repo	null	2021-09-03 11:49:23
harv	12	2021-09-05 09:43:23
weig	12,37,5	2021-09-06 09:25:12
repo	4,8,null,null	2021-09-07 09:12:23
repo	4,8,1,3	2021-09-09 12:12:23
repo	1356	2021-09-10 12:49:23

Some of the values in the 'value' column change their positions (see bold).

Do you have an idea how to fix it?

CodePudding user response：

Because element in value column is string. you can .split() them and sort them with np.sort then back them to string and use drop_duplicates() like below.

Try this:

import numpy as np

df['value2'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(','))))

df.drop_duplicates(['value2'], keep='first')