I'm a beginner at coding & I'm stuck! :(
I have 2 columns: Purpose and Frequency for 10 000 of data with a list of items in each row under "Purpose"
Purpose | Frequency |
---|---|
Bathe, Eat, Sleep | 5 |
Sleep, Eat, Bathe | 4 |
Talk, Eat, Sleep, Movie | 10 |
Sleep, Talk, Movie, Eat | 8 |
Sleep, Eat, Talk, Movie | 2 |
My desired outcome is to have only one row with that combination of elements - where rows with the same combination will be removed from the column, regardless of the order of elements.
Purpose | Frequency |
---|---|
Sleep, Talk, Movie, Eat | 20 |
Bathe, Eat, Sleep | 9 |
If possible, Could you also advise how to add up the frequency and show it in descending order. I'm currently using pandas to code. Your help will be greatly appreciated!!!
Tried drop_duplicates to no avail
CodePudding user response:
You can aggregate sum
with frozenset
s:
s = df['Purpose'].apply(lambda x: frozenset(x.split(', ')))
df.groupby(s)['Frequency'].sum().reset_index()
Or by sorted splitted values:
s = df['Purpose'].apply(lambda x: ', '.join(sorted(x.split(', '))))
df = df.groupby(s)['Frequency'].sum().reset_index()
print (df)
Purpose Frequency
0 Bathe, Eat, Sleep 9
1 Eat, Movie, Sleep, Talk 20
Or split in list comprehension:
L = [', '.join(sorted(x.split(', '))) for x in df['Purpose']]
df = df.groupby(L)['Frequency'].sum().rename_axis('Purpose').reset_index()
Last sorting by column Frequency
:
df = df.sort_values('Frequency', ascending=False, ignore_index=True)
print (df)
Purpose Frequency
0 Eat, Movie, Sleep, Talk 20
1 Bathe, Eat, Sleep 9
CodePudding user response:
Here's another way you could achieve your desired output:
# Necessary imports
import pandas as pd
# == Create Example DataFrame ==================================================
# Create a sample dataframe to test the implementation.
# This dataframe was taken from the example provided
# as markdown.
df = pd.DataFrame(
[
['Bathe, Eat, Sleep', 5],
['Sleep, Eat, Bathe', 4],
['Talk, Eat, Sleep, Movie', 10],
['Sleep, Talk, Movie, Eat', 8],
['Sleep, Eat, Talk, Movie', 2],
], columns = ['Purpose', 'Frequency']
)
# == Preprocess `Purpose` Column ===============================================
# Here's what the code below is doing:
# 0. For each value of the `Purpose` column, we replace any existing
# ', ' into ','. This is to avoid any string represented as something like
# 'Bathe,Eat, Sleep' from being split on the next step
# ^-^
# |
# --- not separated by ', '.
# 1. We split each value from `Purpose` column into different elements.
# In other words, we transform 'Bathe, Eat, Sleep' into ['Bathe', 'Eat', 'Sleep'].
# 2. Then we sort the elements from each of these newly created lists
# 3. Finally, we join the elements back into a single string, separated by ', '.
df['Purpose'] = (
df['Purpose']
# Step 0
.str.replace(', ', ',')
# Step 1
.str.split(',')
# Step 2
.apply(sorted)
# Step 3
.apply(lambda value: ', '.join(value))
)
print(df)
# Prints:
# Purpose Frequency
# 0 Bathe, Eat, Sleep 5
# 1 Bathe, Eat, Sleep 4
# 2 Eat, Movie, Sleep, Talk 10
# 3 Eat, Movie, Sleep, Talk 8
# 4 Eat, Movie, Sleep, Talk 2
# == Group Frequencies that have the same Purpose together =====================
# Now that all elements from each value of the `Purpose` column are
# correctly ordered, we can use the `pandas.DataFrame.groupby` method
# to group `Frequency` values that have the same `Purpose`.
grouped_df = (
df
.groupby('Purpose', as_index=False)['Frequency'].sum()
# The last step is to call the `sort_values` method, to
# sort each summed Frequency in descending order.
.sort_values('Frequency', ascending=False)
)
grouped_df
# Returns:
# Ordered Purpose Frequency
# 1 Eat, Movie, Sleep, Talk 20
# 0 Bathe, Eat, Sleep 9