I need to drop duplicate based on the length of the column "Employee History". The column with the longest length should be kept Note: (there are many, many more columns, but this is the 2 columns that matter for this case)
Company ID | Employee History | |||
---|---|---|---|---|
253 | 462106-27 | 2021: 21, 2022: 26 | ||
264 | 181831-33 | 2019: 20, 2020: 60, 2021: 172, 2022: 225 | ||
338 | 181831-33 | 2019: 20, 2020: 60, 2021: 172 | ||
3481 | 462106-27 | 2021: 21 |
CodePudding user response:
Below solution drops duplicates based on Max lengh of column Employee History
:
#Assign new column with String lengh.
df = df.assign(count=(df["Employee History"].str.len()))
#Drop duplicated columns based on Max lengh.
df = df.loc[df.groupby('Company ID')['count'].idxmax()].drop('count', axis=1)
CodePudding user response:
You can just sort the whole data frame by the length of Employee History
in descending order and then drop duplicates by the subset Company ID
. The drop methods keeps the first element, which is always linked to the longest values of Employee History
.
Code:
df = pd.DataFrame({
"Company ID": ["462106-27", "181831-33", "181831-33", "462106-27"],
"Employee History": ["2021: 21, 2022: 26", "019: 20, 2020: 60, 2021: 172, 2022: 225", "019: 20, 2020: 60, 2021: 172", "2021: 21"]},
index =[253, 264, 338, 3481]
)
df.assign(temp=df["Employee History"].str.len()).sort_values(by=["Company ID", "temp"], ascending=False).drop_duplicates(subset=["Company ID"]).drop(columns="temp")
Output:
Company ID Employee History
253 462106-27 2021: 21, 2022: 26
264 181831-33 019: 20, 2020: 60, 2021: 172, 2022: 22
CodePudding user response:
First, sort the data set by the length of "Employee History". Then insert every row into a OrderedDict using the "Company ID" as key and other columns as value. Finally, restore the dict to table.
Note: from python 3.7, regular dicts are guaranteed to be ordered, too.