Home > Blockchain >  How to find repeated rows in pandas DataFrame for specific columns, and modify values by adding coun
How to find repeated rows in pandas DataFrame for specific columns, and modify values by adding coun

Time:12-24

Consider a dataframe with 2 columns for easiness. The first column is label which has same values for some of the observations in dataset.

Sample dataset:

import pandas as pd
  
data = [('A', 28),
        ('B', 32),
        ('B', 32),
        ('C', 25),
        ('D', 25),
        ('D', 40),
        ('E', 32) ]

data_df = pd.DataFrame(data, columns = ['label', 'num'])

For column label, I want to find rows with similar values. And convert its value into value_counter like below:

label   num
A        28
B_1      32 
B_2      32
C        25
D_1      25
D_2      40
E        32

I tried to use pandas groupby but I don't know which transform I have to use.

Thank you for your help.

CodePudding user response:

You can use:

s = data_df.groupby('label').cumcount() 1
data_df['label'] = np.where(data_df.duplicated(subset='label',  keep=False),
                             data_df['label']   '_'   s.astype(str), data_df['label'])

OUTPUT

  label  num
0     A   28
1   B_1   32
2   B_2   32
3     C   25
4   D_1   25
5   D_2   40
6     E   32

CodePudding user response:

You can create an empty dictionary, which you can append with your label and it's count (keys and values respectively). Then depending on whether the label is new, or it exists, you can increment it's value or return it intact.

The last step would be to use this new list as a new label column:

labels = data_df['label'].tolist()
new_labels = []
label_c = {}

# iterate through your labels list
for val in labels:
    if val not in label_c:     # if label not the new label list
        label_c[val] = 0       # add it to dictionary
        new_labels.append(val) # add it to the output as is
    else:                      # if it's not new
        label_c[val]  = 1      # increment its count
        new_labels.append(f"{val}_{label_c[val]}") # add it to the output along with its count

data_df['label'] = new_labels

prints back:

>>> print(data_df)

  label  num
0     A   28
1     B   32
2   B_1   32
3     C   25
4     D   25
5   D_1   40
6     E   32
  • Related