Remove duplicate in python-CodePudding

I have data frame like this ( shown in image). I need to remove duplicate from inside of Data frame list. can you help me.

CodePudding user response：

Just building on the answer of @Sachin Kohli and the astute observation of @ Ignatius Reilly, you can use a regex to extract the datetimes and then use the same process

Here's an example dataframe similar to yours above;

df = pd.DataFrame({'Date modified':['[18/3/2022 9:35:54, 18/3/2022 9:35:54]','[18/3/2022 9:35:54, 18/3/2022 9:35:54, 18/3/2022 9:35:55]']})

And the following code will extract the dates and remove duplicates.

import re
date_pattern = r'\d /\d /\d  \d :\d :\d '
df["Date modified1"] = df["Date modified"].apply(lambda x:list(set(re.findall(date_pattern,x))))

I've added a new column Date modified1 but obviously you could also just replace the original column as well.

If ordering is important in the lists, then it's probably easiest to use a helper function. The following adds an intermediate step of converting to datetime objects so that ordering works.

import re
import datetime
def helper(s):
    re_date_pattern = r'\d /\d /\d  \d :\d :\d '
    dt_date_pattern = '%d/%m/%Y %H:%M:%S'
    date_strings =  re.findall(re_date_pattern, s)
    date_times = sorted(list(set(map(lambda x: datetime.datetime.strptime(x, dt_date_pattern), date_strings))))
    return list(map(lambda x: datetime.datetime.strftime(x, dt_date_pattern), date_times))

And this can be used in a similar way as the example above, but without the need for the lambda function;

df["Date modified1"] = df["Date modified"].apply(helper)

CodePudding user response：

This is what @bn_ln is referring to... doable using apply-lambda function;

df["Date modified1"] = df["Date modified"].apply(lambda x: list(set(x)))