I have a column containing file names - numerous duplicates - that need to be incremented starting at 001, 002 ...etc. Ex. filename_001.pdf, filename_002.pdf
df_files = pd.DataFrame([[1000, 'filename.pdf'],
[1001, 'filename.pdf'],
[1002, 'a_file.txt'],
[1003, 'a_file.txt'],
[1004, 'a_file.txt']],
columns=['ID', 'filename'])
All the methods I'm finding start at 2.
First extract the extension and filename minus extension:
df_files['ext'] = [os.path.splitext(f)[-1] for f in df_files['filename']]
df_files['Filestub'] = [os.path.splitext(f)[0] for f in df_files['filename']]
The following method will increment successfully, but does not start at 1 nor does it use a convention that would allow for triple digit duplicates ( Eg., 00X ).
df_files['NumberedCopy'] = df_files['filename'].where(~df_files['filename'].duplicated(),
df_files['Filestub'] "_"\
df_files.groupby('Filestub').cumcount().add(1).astype(str) df_files['ext'])
Output [ WRONG ]:
ID filename Filestub ext NumberedCopy
0 1000 filename.pdf filename .pdf filename.pdf
1 1001 filename.pdf filename .pdf filename_2.pdf
2 1002 a_file.txt a_file .txt a_file.txt
3 1003 a_file.txt a_file .txt a_file_2.txt
4 1004 a_file.txt a_file .txt a_file_3.txt
Desired output:
ID filename Filestub ext NumberedCopy
0 1000 filename.pdf filename .pdf filename_001.pdf
1 1001 filename.pdf filename .pdf filename_002.pdf
2 1002 a_file.txt a_file .txt a_file_001.txt
3 1003 a_file.txt a_file .txt a_file_002.txt
4 1004 a_file.txt a_file .txt a_file_003.txt
CodePudding user response:
Try:
numbered = df_files["Filestub"] "_" df_files.groupby("Filestub").cumcount().add(1).astype(str).str.zfill(3) df_files["ext"]
df["NumberedCopy"] = numbered.where(df_files["Filestub"].duplicated(keep=False), df_files["filename"])
>>> df_files
ID filename ext Filestub NumberedCopy
0 1000 filename.pdf .pdf filename filename_001.pdf
1 1001 filename.pdf .pdf filename filename_002.pdf
2 1002 a_file.txt .txt a_file a_file_001.txt
3 1003 a_file.txt .txt a_file a_file_002.txt
4 1004 a_file.txt .txt a_file a_file_003.txt