I have a dataframe like the following:
df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
My goal is if there are duplicated values in Col1, I would then sort (only the duplicate values) by the values in Col2 from smallest to largest, and rename the original "Value" in Col1 to be "Value.A" (for duplicates with smallest value in Col2) "Value.B" (for 2nd smallest, etc). Value of the Col3
Using the example above, this is what I should end up with:
pd.DataFrame({"Col1": ["AA.B", "AB", "AA.A", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
Since 13<18 so the 2nd "AA" becomes "AA.A" and first "AA" becomes AA.B. (values in Col3 stays unchanged). Also, "AB","CC","FF" all needs to remain unchanged. I could have potentially more than 1 sets of duplicates in Col1.
I do not need to preserve the rows, so long as the values in each row stay the same except the renamed value in Col1. (i.e., I should still have "AA.B", 18, 17 for the 3 columns no matter where the 1st row in the output moves to).
I tried to use the row['Col1'] == df['Col1'].shift()
as a lambda function but this gives me the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I suspect this was due to the na value when I called shift() but using fillna() doesn't help since that will always create a duplicate at the beginning.
Any suggestions on how I can make it work?
CodePudding user response:
You can use the pandas Groupby.apply with a custom function to get what you want.
You group the dataframe by the first column and then apply you custom function to each "sub" dataframe. In this case I check if there is a duplicate and if so I use some sorting and ASCII value switching to generate the new labels that you need.
# your example
df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
def de_dup_labels(x):
"""A function to de-duplicate labels based on column orders"""
# if at least 1 duplicate exists
if len(x) > 1:
# work out what order Col2 would be in if it were sorted
order = x["Col2"].argsort().values
# update Col1 with the new labels using the order from above
x["Col1"] = np.asarray([f"{x['Col1'].iloc[i]}.{chr(ord('A') i)}"
for i in range(len(x))])[order]
return x
updated_df = df.groupby("Col1").apply(de_dup_labels)
CodePudding user response:
The logic is:
- Group by "Col1".
- Collect Col2 as map {index: Col2} and sorted by Col2.
- Replace "Col1" as "Col1.1", "Col1.2", ... if it is duplicate.
- Join new "Col1" using original index.
PS - I have changed the suffix logic to use ".1", ".2" ... as it is not clear what series to use when values are exhausted with ".A", ".B".
df = pd.DataFrame({"Col1": ["AA", "AB", "AA", "CC", "FF"],
"Col2": [18, 23, 13, 33, 48],
"Col3": [17, 27, 22, 37, 52]})
df_grp = df.groupby("Col1") \
.agg(Col2=("Col2", \
# Collect Col2 as map {index: Col2} and sorted by Col2.
lambda s: sorted(
{x[0]:x[1] for x in zip(s.index, s)}.items(),
key=lambda y: y[1]
)
)) \
.reset_index()
# Mark the duplicate values.
df_grp["is_dup"] = df_grp["Col2"].apply(lambda x: len(x) > 1)
# Replace "Col1" as "Col1.1", "Col1.2", ... if it is duplicate.
df_grp = df_grp.explode("Col2").reset_index().rename(columns={"index": "Col2_index"})
df_grp["Col2_index"] = df_grp.groupby("Col2_index").cumcount()
df_grp["Col1"] = df_grp.apply(lambda x: f'{x["Col1"]}.{x["Col2_index"] 1}' if x["is_dup"] else x["Col1"], axis=1)
# Restore original index.
df_grp["orig_index"] = df_grp["Col2"].apply(lambda x: x[0])
df_grp = df_grp.set_index("orig_index")
# Join new "Col1" using original index.
df = df.drop("Col1", axis=1).join(df_grp[["Col1"]])
Output:
Col2 Col3 Col1
0 18 17 AA.2
1 23 27 AB
2 13 22 AA.1
3 33 37 CC
4 48 52 FF