How to remove duplicates lists in a column-CodePudding

I'm a beginner at coding & I'm stuck! :(

I have 2 columns: Purpose and Frequency for 10 000 of data with a list of items in each row under "Purpose"

Purpose	Frequency
Bathe, Eat, Sleep	5
Sleep, Eat, Bathe	4
Talk, Eat, Sleep, Movie	10
Sleep, Talk, Movie, Eat	8
Sleep, Eat, Talk, Movie	2

My desired outcome is to have only one row with that combination of elements - where rows with the same combination will be removed from the column, regardless of the order of elements.

Purpose	Frequency
Sleep, Talk, Movie, Eat	20
Bathe, Eat, Sleep	9

If possible, Could you also advise how to add up the frequency and show it in descending order. I'm currently using pandas to code. Your help will be greatly appreciated!!!

Tried drop_duplicates to no avail

CodePudding user response：

You can aggregate sum with frozensets:

s = df['Purpose'].apply(lambda x: frozenset(x.split(', ')))
df.groupby(s)['Frequency'].sum().reset_index()

Or by sorted splitted values:

s = df['Purpose'].apply(lambda x: ', '.join(sorted(x.split(', '))))
df = df.groupby(s)['Frequency'].sum().reset_index()
print (df)
                   Purpose  Frequency
0        Bathe, Eat, Sleep          9
1  Eat, Movie, Sleep, Talk         20

Or split in list comprehension:

L = [', '.join(sorted(x.split(', '))) for x in df['Purpose']]
df = df.groupby(L)['Frequency'].sum().rename_axis('Purpose').reset_index()

Last sorting by column Frequency:

df = df.sort_values('Frequency', ascending=False, ignore_index=True)
print (df)
                   Purpose  Frequency
0  Eat, Movie, Sleep, Talk         20
1        Bathe, Eat, Sleep          9

CodePudding user response：

Here's another way you could achieve your desired output:


# Necessary imports
import pandas as pd

# == Create Example DataFrame ==================================================
# Create a sample dataframe to test the implementation.
# This dataframe was taken from the example provided
# as markdown.
df = pd.DataFrame(
    [
        ['Bathe, Eat, Sleep', 5],
        ['Sleep, Eat, Bathe', 4],
        ['Talk, Eat, Sleep, Movie', 10],
        ['Sleep, Talk, Movie, Eat', 8],
        ['Sleep, Eat, Talk, Movie', 2],
    ], columns = ['Purpose', 'Frequency']
)

# == Preprocess `Purpose` Column ===============================================
# Here's what the code below is doing:
# 0. For each value of the `Purpose` column, we replace any existing
#    ', ' into ','. This is to avoid any string represented as something like
#    'Bathe,Eat, Sleep' from being split on the next step
#         ^-^
#         |
#          --- not separated by ', '.
# 1. We split each value from `Purpose` column into different elements.
#    In other words, we transform 'Bathe, Eat, Sleep' into ['Bathe', 'Eat', 'Sleep'].
# 2. Then we sort the elements from each of these newly created lists
# 3. Finally, we join the elements back into a single string, separated by ', '.
df['Purpose'] = (
    df['Purpose']
    # Step 0
    .str.replace(', ', ',')
    # Step 1
    .str.split(',')
    # Step 2
    .apply(sorted)
    # Step 3
    .apply(lambda value: ', '.join(value))
)
print(df)
# Prints:
#                    Purpose  Frequency
# 0        Bathe, Eat, Sleep          5
# 1        Bathe, Eat, Sleep          4
# 2  Eat, Movie, Sleep, Talk         10
# 3  Eat, Movie, Sleep, Talk          8
# 4  Eat, Movie, Sleep, Talk          2

# == Group Frequencies that have the same Purpose together =====================
# Now that all elements from each value of the `Purpose` column are
# correctly ordered, we can use the `pandas.DataFrame.groupby` method
# to group `Frequency` values that have the same `Purpose`.
grouped_df = (
    df
    .groupby('Purpose', as_index=False)['Frequency'].sum()
    # The last step is to call the `sort_values` method, to
    # sort each summed Frequency in descending order.
    .sort_values('Frequency', ascending=False)
)
grouped_df
# Returns:
#            Ordered Purpose  Frequency
# 1  Eat, Movie, Sleep, Talk         20
# 0        Bathe, Eat, Sleep          9