Home > Mobile >  Modify duplicated rows in dataframe (Python)
Modify duplicated rows in dataframe (Python)

Time:12-29

I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:

      City           Year       Restaurants
0   New York         2001       20
1      Paris         2000       40
2   New York         1999       41
3   Los Angeles      2004       35
4     Madrid         2001       22
5   New York         1998       33
6   Barcelona        2001       15

As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:

      City           Year       Restaurants
0   New York 2001    2001       20
1      Paris         2000       40
2   New York 1999    1999       41
3   Los Angeles      2004       35
4     Madrid         2001       22
5   New York 1998    1998       33
6   Barcelona        2001       15

I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.

CodePudding user response:

Use np.where, to modify column City if duplicated

df['City']=np.where(df['City'].duplicated(keep=False), df['City'] ' ' df['Year'].astype(str),df['City'])

CodePudding user response:

A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.

df['City'] = df['City']   ' '   df.groupby('City').cumcount().add(1).astype(str)

            City  Year  Restaurants
0     New York 1  2001           20
1        Paris 1  2000           40
2     New York 2  1999           41
3  Los Angeles 1  2004           35
4       Madrid 1  2001           22
5     New York 3  1998           33
6    Barcelona 1  2001           15

To have an increment only in the duplicate cases you can use loc:

df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City']   ' '   df.groupby('City').cumcount().add(1).astype(str)

          City  Year  Restaurants
0   New York 1  2001           20
1        Paris  2000           40
2   New York 2  1999           41
3  Los Angeles  2004           35
4       Madrid  2001           22
5   New York 3  1998           33
6    Barcelona  2001           15
  • Related