I am trying to add a column to an existing dataframe by splitting different parts of file name with special character underscore '_'.
Below is the screenshot of my dataframe.
Now I would like to have another column added to the existing dataframe that has values after '_' and before .jpg like 1 in row1 and 10 in row2.
Below is the code I am using, but I am getting the error invalid literal for int() with base 10: combined
master_df = pd.DataFrame()
master_df['filename'] = combined_faces_image_names
master_df['age'] = master_df['filename'].map(lambda img_name : np.uint8(img_name.split("_")[0]))
Could someone please help me?
CodePudding user response:
The error actually tells you, that the word combined
in your 33486 row that comes after splitting combined_faces.zip
cannot be converted to int.
Try to remove this row from the DataFrame and then do the conversion and splitting, or locate only rows containing .jpg
rows for that purpose:
master_df['age'] = master_df['filename'].map(lambda img_name: np.uint8(img_name.split("_")[0]) if '.jpg' in img_name else None)
CodePudding user response:
You might use pandas.Series.str.extract
for this task, minifed example:
import pandas as pd
df = pd.DataFrame({'name':['100_1.jpg','100_10.jpg','combined_faces.zip']})
df['value'] = df['name'].str.extract(r'(\d ).jpg$')
print(df)
output
name value
0 100_1.jpg 1
1 100_10.jpg 10
2 combined_faces.zip NaN
Explanation: extract one or more (
) digits (\d
) followed by .jpg
and end of string ($
). Note that in above example value (where not NaN
) are str
s. If you need numbers, you might use pandas.to_numeric
as follows
df['value'] = pd.to_numeric(df['name'].str.extract(r'(\d ).jpg$',expand=False))