Home > front end >  How to mark the duplicated items in dataframe?
How to mark the duplicated items in dataframe?

Time:11-10

I am trying to mark the index number for each duplicated item in the below dataframe.

  Column_A
0   Kitten
1   Kitten
2     Judy
3     Lamb
4     Momo
5     Judy

The new dataframe I want is as follows. As you can see, the items which have more than one (such as "Judy") are marked.

   Column_B Column_A
0   Kitten_1   Kitten
1   Kitten_2   Kitten
2     Judy_1   Judy
3     Lamb_1   Lamb
4    Momo _2   Momo
5     Judy_2    Judy

My code is as follows.

import pandas as pd
# My dataframe
data = {'Column_A':['Kitten','Kitten','Judy','Lamb','Momo','Judy']}
df = pd.DataFrame(data)

df = df.sort_values("Column_A", axis = 0, ascending = False)

df_list = df['Column_A'].tolist()
df_list.sort()

new_df_list = list(set(df_list))
new_list = []
for k in new_df_list:
    ori_list = df[df["Column_A"] == k ]['Column_A'].tolist()
    number = len(ori_list)
    for n in range(number):
        new_list.append(k '_' str(n 1))
        print(k '_' str(n 1))
print(len(new_list))
new_list.sort(reverse=True)
new_DF=pd.DataFrame(new_list, columns = ['Column_B'])
final_df = pd.concat([new_DF,df],axis=1)
print(final_df)

However, the output is as below, which is out of order.

   Column_B Column_A
0    Momo_1   Kitten
1    Lamb_1   Kitten
2  Kitten_2     Judy
3  Kitten_1     Lamb
4    Judy_2     Momo
5    Judy_1     Judy

Please help me fix this issue.

CodePudding user response:

First create your dataframe:

import pandas as pd
mylist =  ['Kitten', 'Kitten', 'Judy', 'Lamb', 'Momo', 'Judy']
df=pd.DataFrame()
df['Column_A'] = mylist

Then create a new my_index column filled with ones:

df['my_index'] = 1

You may then iterate over the unique names in Column_A and compute the cumulative sum of index my_index in each of the resulting dataframes df[df.Column_A==name]. The indices summed this way are saved to a new column named label.

list_of_series = []
for name in df.Column_A.unique():
    list_of_series.append(df[df.Column_A==name]['my_index'].cumsum())
output_series = pd.concat(list_of_series)
df['label']= output_series

Then it is just a matter of creating Column_B by summing the names in Column_A with the corresponding label strings:

df['Column_B'] = df['Column_A'].astype('str')  '_'  df['label'].astype(str)
df.drop(columns=['my_index', 'label'])

This returns the following output:

    Column_A    Column_B
0   Kitten  Kitten_1
1   Kitten  Kitten_2
2   Judy    Judy_1
3   Lamb    Lamb_1
4   Momo    Momo_1
5   Judy    Judy_2

CodePudding user response:

try this:

out = (df.Column_A.str.cat((df.groupby('Column_A')
                            .cumcount() 1).astype('str'), sep='_')
       .rename('Column_B')
       .to_frame()
       .join(df))
print(out)
>>>

    Column_B    Column_A
0   Kitten_1    Kitten
1   Kitten_2    Kitten
2   Judy_1      Judy
3   Lamb_1      Lamb
4   Momo_1      Momo
5   Judy_2      Judy
  • Related