I have a table like this:
event | value | time |
---|---|---|
seed | 57 | 2021-08-01 09:49:23 |
ghy | 869 | 2021-08-02 09:50:12 |
repo | 5324 | 2021-09-03 10:49:23 |
repo | null | 2021-09-03 11:49:23 |
harv | 12 | 2021-09-05 09:43:23 |
weig | 5,37,12 | 2021-09-06 09:25:12 |
repo | null,null,4,8 | 2021-09-07 09:12:23 |
repo | 4,8,null,null | 2021-09-07 10:49:23 |
repo | null,null,4,8 | 2021-09-08 17:49:23 |
repo | 4,8,1,3 | 2021-09-09 12:12:23 |
repo | 1356 | 2021-09-10 12:49:23 |
Sometimes the value
column has the following pattern: null, null, x, y
, where x and y are any natural numbers.
Do you know how to delete all pairs of records from the diagram: x, y, null, null
and then again null, null, x, y
immediately after the first occurrence of such a pattern?
I mean the expected output should be:
event | value | time |
---|---|---|
seed | 57 | 2021-08-01 09:49:23 |
ghy | 869 | 2021-08-02 09:50:12 |
repo | 5324 | 2021-09-03 10:49:23 |
repo | null | 2021-09-03 11:49:23 |
harv | 12 | 2021-09-05 09:43:23 |
weig | 5,37,12 | 2021-09-06 09:25:12 |
repo | null,null,4,8 | 2021-09-07 09:12:23 |
repo | 4,8,1,3 | 2021-09-09 12:12:23 |
repo | 1356 | 2021-09-10 12:49:23 |
When according to one of the answers, I use it:
import numpy as np
df['value'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(','))))
df.drop_duplicates(['value'], keep='first')
I get:
event | value | time |
---|---|---|
seed | 57 | 2021-08-01 09:49:23 |
ghy | 869 | 2021-08-02 09:50:12 |
repo | 5324 | 2021-09-03 10:49:23 |
repo | null | 2021-09-03 11:49:23 |
harv | 12 | 2021-09-05 09:43:23 |
weig | 12,37,5 | 2021-09-06 09:25:12 |
repo | 4,8,null,null | 2021-09-07 09:12:23 |
repo | 4,8,1,3 | 2021-09-09 12:12:23 |
repo | 1356 | 2021-09-10 12:49:23 |
Some of the values in the 'value' column change their positions (see bold).
Do you have an idea how to fix it?
CodePudding user response:
Because element in value column
is string. you can .split()
them and sort them with np.sort
then back them to string and use drop_duplicates()
like below.
Try this:
import numpy as np
df['value2'] = df['value'].apply(lambda x : ','.join(np.sort(x.split(','))))
df.drop_duplicates(['value2'], keep='first')