Home > database >  python pandas how to edit value of duplicate?
python pandas how to edit value of duplicate?

Time:05-03

I want to change value of my duplicate. Here is my dataframe:

  sku
FAT-001
FAT-001
FAT-001
FAT-002
FAT-002

my expected data frame will be look like this

  sku
FAT-001 #don't want to change first value of duplicate 
FAT-001-01
FAT-001-02
FAT-002
FAT-002-01

CodePudding user response:

df["number"] = df.groupby("sku").cumcount()
df.apply(lambda x: x.sku   ("" if x.number == 0 else "-"   str(x.number).rjust(2,"0")), axis = 1)

or:

df["number"] = "-"   df.groupby("sku").cumcount().astype(str).str.rjust(2, '0')
df.number[df["number"].eq("-"   "0" * 2)] = ""
df.sku   df.number 

my output:

0       FAT-001
1    FAT-001-01
2    FAT-001-02
3       FAT-002
4    FAT-002-01
dtype: object

explanation:

what is groupby? it is a sql-inspired command that give you some element base on every unique element... for example: length, max, list or other...

df = pd.DataFrame([
    [1,2],
    [1,3],
    [1,4],
    [2,5],
    [2,6],
], columns=["id","number"])
df.groupby("id").agg({"number": len})

give you:

    number
id  
1   3
2   2

the number of every unique elements, or

df.groupby("id").agg({"number": list})

give you

    number
id  
1   [2, 3, 4]
2   [5, 6]

and you can try max or min or first...

using agg, you can specify what you want for every column... if you have more than one (except what was groupby), you can specify something for each column...

except agg, there are other methods for goupped dataframes: like cumcount, that set index every row for each group, I mean it reset index for every group:

df.groupby("sku").cumcount()

output:

0    0
1    1
2    2
3    0
4    1

your first FAT-001 get index: 0, next: 1, ... and for FAT-002, first get index 0 again...

so, we have two part that you want, now... therefore we must find a way to join them for every row: axis:1 in apply means for every row

so, you have one exception: you don't want index for every row of each group... so change it to "", empty:

df.apply(lambda x: "" if x.number == 0 else str(x.number), axis = 1)

:

0     
1    1
2    2
3     
4    1
dtype: object

your 0 and 3 rows are for new-group...

next, your desired format is: 01,02,... a 0 for every index. pandas has a method that convert every string to string with desired length with an arbitrary char: rjust(desird_length, arbitrary_char)

how works: if you call that as rjust(2,"0"), it dos not change "22" or other 2-char, 3-char, ... strings, but, if your string length be 1 like 1 will be converted to 01 and ... (notice that there are a method called ljust too :))

df["number"] = df.groupby("sku").cumcount()
df.apply(lambda x: "" if x.number == 0 else str(x.number).rjust(2,"0"), axis = 1)
0      
1    01
2    02
3      
4    01
dtype: object

and if statement can written as:

if x.number == 0:
   return ""
else:
    return "-"   str(x.number).rjust(2,"0")

and some points:

  1. what is astype(str): it convert every element to string, and works like str(x), but for every element. why? to add "-" before it and using rjust.
  2. what is eq? it is is-equal? and return True for each row if be equal to value, and False otherwise.
  3. why df.number[df["number"].eq("-" "0" * 2)] = ""? because we convert all first elements of each group to ""
  4. why "-" "0" * 2? because we add "-" and ljust in previous line so we must use correct value: "-00". and why "0" * 2? because you can use every number for ljust length like 10 and set it there, too

CodePudding user response:

Similar to @MoRe's answer, use groupby.cumcount to create groups; then you could use str.zfill to pad 0s and mask the first elements of each group:

groups = df.groupby('sku').cumcount()
df['new'] = df['sku']   ('-'   groups.astype('string').str.zfill(2)).mask(groups.eq(0), '')

Output:

       sku         new
0  FAT-001     FAT-001
1  FAT-001  FAT-001-01
2  FAT-001  FAT-001-02
3  FAT-002     FAT-002
4  FAT-002  FAT-002-01

CodePudding user response:

For an alternative one-liner:

 df.sku = df.sku   df.groupby('sku').cumcount().apply(lambda x: f"-{x:02d}" if x > 0 else '')

Output:

          sku
0     FAT-001
1  FAT-001-01
2  FAT-001-02
3     FAT-002
4  FAT-002-01
  • Related