Home > Back-end >  Pandas: Need to increment duplicate file names starting at 1
Pandas: Need to increment duplicate file names starting at 1

Time:05-11

I have a column containing file names - numerous duplicates - that need to be incremented starting at 001, 002 ...etc. Ex. filename_001.pdf, filename_002.pdf

df_files = pd.DataFrame([[1000, 'filename.pdf'], 
                         [1001, 'filename.pdf'], 
                         [1002, 'a_file.txt'],
                         [1003, 'a_file.txt'],
                         [1004, 'a_file.txt']],
                       columns=['ID', 'filename'])

All the methods I'm finding start at 2.

First extract the extension and filename minus extension:

df_files['ext'] = [os.path.splitext(f)[-1] for f in df_files['filename']]
df_files['Filestub'] = [os.path.splitext(f)[0] for f in df_files['filename']]

The following method will increment successfully, but does not start at 1 nor does it use a convention that would allow for triple digit duplicates ( Eg., 00X ).

df_files['NumberedCopy'] = df_files['filename'].where(~df_files['filename'].duplicated(), 
                                                           df_files['Filestub']   "_"\
                                                             df_files.groupby('Filestub').cumcount().add(1).astype(str)   df_files['ext'])

Output [ WRONG ]:

    ID          filename        Filestub    ext     NumberedCopy
0   1000        filename.pdf    filename    .pdf    filename.pdf
1   1001        filename.pdf    filename    .pdf    filename_2.pdf
2   1002        a_file.txt      a_file      .txt    a_file.txt
3   1003        a_file.txt      a_file      .txt    a_file_2.txt
4   1004        a_file.txt      a_file      .txt    a_file_3.txt

Desired output:

    ID      filename        Filestub    ext     NumberedCopy
0   1000    filename.pdf    filename    .pdf    filename_001.pdf
1   1001    filename.pdf    filename    .pdf    filename_002.pdf
2   1002    a_file.txt      a_file      .txt    a_file_001.txt
3   1003    a_file.txt      a_file      .txt    a_file_002.txt
4   1004    a_file.txt      a_file      .txt    a_file_003.txt

CodePudding user response:

Try:

numbered = df_files["Filestub"]   "_"   df_files.groupby("Filestub").cumcount().add(1).astype(str).str.zfill(3)   df_files["ext"]

df["NumberedCopy"] = numbered.where(df_files["Filestub"].duplicated(keep=False), df_files["filename"])

>>> df_files
     ID      filename   ext  Filestub      NumberedCopy
0  1000  filename.pdf  .pdf  filename  filename_001.pdf
1  1001  filename.pdf  .pdf  filename  filename_002.pdf
2  1002    a_file.txt  .txt    a_file    a_file_001.txt
3  1003    a_file.txt  .txt    a_file    a_file_002.txt
4  1004    a_file.txt  .txt    a_file    a_file_003.txt
  • Related