I am trying to mark the index number for each duplicated item in the below dataframe.
Column_A
0 Kitten
1 Kitten
2 Judy
3 Lamb
4 Momo
5 Judy
The new dataframe I want is as follows. As you can see, the items which have more than one (such as "Judy") are marked.
Column_B Column_A
0 Kitten_1 Kitten
1 Kitten_2 Kitten
2 Judy_1 Judy
3 Lamb_1 Lamb
4 Momo _2 Momo
5 Judy_2 Judy
My code is as follows.
import pandas as pd
# My dataframe
data = {'Column_A':['Kitten','Kitten','Judy','Lamb','Momo','Judy']}
df = pd.DataFrame(data)
df = df.sort_values("Column_A", axis = 0, ascending = False)
df_list = df['Column_A'].tolist()
df_list.sort()
new_df_list = list(set(df_list))
new_list = []
for k in new_df_list:
ori_list = df[df["Column_A"] == k ]['Column_A'].tolist()
number = len(ori_list)
for n in range(number):
new_list.append(k '_' str(n 1))
print(k '_' str(n 1))
print(len(new_list))
new_list.sort(reverse=True)
new_DF=pd.DataFrame(new_list, columns = ['Column_B'])
final_df = pd.concat([new_DF,df],axis=1)
print(final_df)
However, the output is as below, which is out of order.
Column_B Column_A
0 Momo_1 Kitten
1 Lamb_1 Kitten
2 Kitten_2 Judy
3 Kitten_1 Lamb
4 Judy_2 Momo
5 Judy_1 Judy
Please help me fix this issue.
CodePudding user response:
First create your dataframe:
import pandas as pd
mylist = ['Kitten', 'Kitten', 'Judy', 'Lamb', 'Momo', 'Judy']
df=pd.DataFrame()
df['Column_A'] = mylist
Then create a new my_index
column filled with ones:
df['my_index'] = 1
You may then iterate over the unique names in Column_A
and compute the cumulative sum of index my_index
in each of the resulting dataframes df[df.Column_A==name]
. The indices summed this way are saved to a new column named label
.
list_of_series = []
for name in df.Column_A.unique():
list_of_series.append(df[df.Column_A==name]['my_index'].cumsum())
output_series = pd.concat(list_of_series)
df['label']= output_series
Then it is just a matter of creating Column_B
by summing the names in Column_A
with the corresponding label
strings:
df['Column_B'] = df['Column_A'].astype('str') '_' df['label'].astype(str)
df.drop(columns=['my_index', 'label'])
This returns the following output:
Column_A Column_B
0 Kitten Kitten_1
1 Kitten Kitten_2
2 Judy Judy_1
3 Lamb Lamb_1
4 Momo Momo_1
5 Judy Judy_2
CodePudding user response:
try this:
out = (df.Column_A.str.cat((df.groupby('Column_A')
.cumcount() 1).astype('str'), sep='_')
.rename('Column_B')
.to_frame()
.join(df))
print(out)
>>>
Column_B Column_A
0 Kitten_1 Kitten
1 Kitten_2 Kitten
2 Judy_1 Judy
3 Lamb_1 Lamb
4 Momo_1 Momo
5 Judy_2 Judy