I have a python data frame with values in a string separated by delimiter and a corresponding column with numeric value. I want to combine the rows with same list of unique elements in the string irrespective of their sequence and aggregate the corresponding column to that row
The data frame appears like this:
df = pd.DataFrame([['Organic Search, Direct', 171], ['Organic Search, Direct, Not Found', 3], ['Direct, Organic Search', 389], ['Direct, Organic Search, Not Found', 8]], columns=['A', 'B'])
df
And I want the result to appear like this:
df = pd.DataFrame([['Direct, Organic Search',560], ['Direct, Organic Search, Not Found',11]], columns = ['A','B'])
Initially I thought of using a for loop to traverse through each row of the data frame and store the list of elements in an array alphabetically, then combining the rows based on that processed row.
However, I couldn't formulate this into a code.
CodePudding user response:
You can easily do this. Example:
import pandas as pd
df = pd.DataFrame([['Organic Search, Direct', 171], ['Organic Search, Direct, Not Found', 3], ['Direct, Organic Search', 389], ['Direct, Organic Search, Not Found', 8]], columns=['A', 'B'])
print(df)
Out:
A B
0 Organic Search, Direct 171
1 Organic Search, Direct, Not Found 3
2 Direct, Organic Search 389
3 Direct, Organic Search, Not Found 8
def foo(x):
x = x.split(', ')
x.sort()
return tuple(x)
df.groupby(df.A.apply(foo))['B'].sum()
out:
A
(Direct, Not Found, Organic Search) 11
(Direct, Organic Search) 560
CodePudding user response:
I can't take credit for this, but I'm not sure how to present it outside of a separate answer.
I am also new at Python. Using padu's logic code, I've adapted it to your original post's data.
I also substituted ['B']
with .reset_index()
to keep the original columns, as the sums were previously merged with the newly grouped values in a single column. I added also added numeric_only=True
because without specifying a column to sum, it would normally assume to stick with numeric columns, but the assumption is now a deprecated feature, as per this message I got:
The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum().reset_index()
You may also want to add df =
at the beginning of the last line, as below, to keep the result in the original variable name. Missing this step caused me to lose lots of time while learning to debug.
import pandas as pd
df = pd.DataFrame([
['Direct, Organic Search',171],
['Direct, Organic Search, Not Found',4],
['Direct, Organic Search, Not Found, Referral',12],
['Direct, Organic Search, Not Found, Referral, Email',2],
['Organic Search, Direct',389],
['Organic Search, Direct, Not Found',3],
['Organic Search, Direct, Referral',76],
['Organic Search, Direct, Referral, Social Network',1]],
columns =['MCF Channel Grouping Path_1', 'Conversions'])
def sort_groupings(row_group):
row_group = row_group.split(', ') # Converts string into list using ', ' as the separator
row_group.sort() # Sorts newly, made list in ascending order
return tuple(row_group) # Converts list to a tuple (required by the 'apply' function), and returns it
df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum(numeric_only=True).reset_index()
# groupby: column in which to group common values
# apply: call a function to run on values specified
# reset_index: reinstates the first column in the final df