invalid literal for int() with base 10: while trying to create a dataframe with split '

I am trying to add a column to an existing dataframe by splitting different parts of file name with special character underscore '_'.

Below is the screenshot of my dataframe.

Now I would like to have another column added to the existing dataframe that has values after '_' and before .jpg like 1 in row1 and 10 in row2.

Below is the code I am using, but I am getting the error invalid literal for int() with base 10: combined

master_df = pd.DataFrame()
master_df['filename'] = combined_faces_image_names
master_df['age'] = master_df['filename'].map(lambda img_name : np.uint8(img_name.split("_")[0]))

Could someone please help me?

CodePudding user response：

The error actually tells you, that the word combined in your 33486 row that comes after splitting combined_faces.zip cannot be converted to int.

Try to remove this row from the DataFrame and then do the conversion and splitting, or locate only rows containing .jpg rows for that purpose:

master_df['age'] = master_df['filename'].map(lambda img_name: np.uint8(img_name.split("_")[0]) if '.jpg' in img_name else None)

CodePudding user response：

You might use pandas.Series.str.extract for this task, minifed example:

import pandas as pd
df = pd.DataFrame({'name':['100_1.jpg','100_10.jpg','combined_faces.zip']})
df['value'] = df['name'].str.extract(r'(\d ).jpg$')

print(df)

output
                 name value
0           100_1.jpg     1
1          100_10.jpg    10
2  combined_faces.zip   NaN

Explanation: extract one or more ( ) digits (\d) followed by .jpg and end of string ($). Note that in above example value (where not NaN) are strs. If you need numbers, you might use pandas.to_numeric as follows

df['value'] = pd.to_numeric(df['name'].str.extract(r'(\d ).jpg$',expand=False))