Home > Mobile >  How to remove specific records based on column pattern
How to remove specific records based on column pattern

Time:10-05

I have a table like this:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,null,null 2021-09-07 10:49:23
repo null,null,4,8 2021-09-08 17:49:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

Sometimes the value column has the following pattern: null, null, x, y, where x and y are any natural numbers.

Do you know how to delete all pairs of records from the diagram: x, y, null, null and then again null, null, x, y immediately after the first occurrence of such a pattern?

I mean the expected output should be:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 5,37,12 2021-09-06 09:25:12
repo null,null,4,8 2021-09-07 09:12:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

When according to one of the answers, I use it:

import numpy as np

df['value'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(','))))

df.drop_duplicates(['value'], keep='first')

I get:

event value time
seed 57 2021-08-01 09:49:23
ghy 869 2021-08-02 09:50:12
repo 5324 2021-09-03 10:49:23
repo null 2021-09-03 11:49:23
harv 12 2021-09-05 09:43:23
weig 12,37,5 2021-09-06 09:25:12
repo 4,8,null,null 2021-09-07 09:12:23
repo 4,8,1,3 2021-09-09 12:12:23
repo 1356 2021-09-10 12:49:23

Some of the values ​​in the 'value' column change their positions (see bold).

Do you have an idea how to fix it?

CodePudding user response:

Because element in value column is string. you can .split() them and sort them with np.sort then back them to string and use drop_duplicates() like below.

Try this:

import numpy as np

df['value2'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(','))))

df.drop_duplicates(['value2'], keep='first')
  • Related