So I have around 65,000 jpg images of cars, each filename has information about the car. For example:
Acura_ILX_2013_28_16_110_15_4_70_55_179_39_FWD_5_4_4dr_aWg.jpg
'Displacement', 'Engine Type', 'Width, Max w/o mirrors (in)', 'Height, Overall (in)',
'Length, Overall (in)', 'Gas Mileage', 'Drivetrain', 'Passenger Capacity', 'Passenger Doors',
'Body Style' 'unique identifier'
Because there are different images of the same car, a unique 3 letter identifier is used at the end of each file.
I have created a data frame from the file names using the following code:
car_file = os.listdir(r"dir")
make = []
model = []
year = []
msrp = []
front_wheel_size = []
sae_net_hp = []
displacement = []
engine_type = []
width = []
height = []
length = []
mpg = []
drivetrain = []
passenger_capacity = []
doors = []
body_style = []
for i in car_file:
make.append(i.split("_")[0])
model.append(i.split("_")[1])
year.append(i.split("_")[2])
msrp.append(i.split("_")[3])
front_wheel_size.append(i.split("_")[4])
sae_net_hp.append(i.split("_")[5])
displacement.append(i.split("_")[6])
engine_type.append(i.split("_")[7])
width.append(i.split("_")[8])
height.append(i.split("_")[9])
length.append(i.split("_")[10])
mpg.append(i.split("_")[11])
drivetrain.append(i.split("_")[12])
passenger_capacity.append(i.split("_")[13])
doors.append(i.split("_")[14])
body_style.append(i.split("_")[15])
df = pd.DataFrame([make,model,year,msrp,front_wheel_size,sae_net_hp,displacement,engine_type,width,height,length,mpg,drivetrain,passenger_capacity,doors,body_style]).T
(It is not the cleanest way to do this I presume)
My question is, how I can most efficiently include the jpg image in the dataset maybe as an additional column at the end.
CodePudding user response:
I am not really sure if you actually WANT to open all 65'000 images at once, as this may occupy huge amounts of memory. I'd recommend simply saving the path to the image in the DataFrame.
If you really want to open it, see: How to read images into a script?
But to clean up your original code: I did something similar a while back and I solved it via regex. That might be overdoing it though. But you can use split directly to put your values into rows instead of building columns. Both ideas in the example below (might contain errors).
from pathlib import Path
import re
import pandas as pd
import matplotlib.image as mpimg
from typing import Iterable, List
FILEPARTS = [
"make", "model", "year", "msrp", "front_wheel_size",
"sae_net_hp", "displacement", "engine_type",
"width", "height", "length", "mpg",
"drivetrain", "passenger_capacity",
"doors", "body_style", "id"
]
def via_regex(path_to_folder: str) -> pd.DataFrame:
""" Matches filenames via regex.
This way you would skip all files in the folder that are not
.jpg and also don't match your pattern."""
folder = Path(path_to_folder)
# select only .jpg files
files = folder.glob('*.jpg')
matches = filename_matcher(files)
# build DataFrame
df = pd.DataFrame(m.groupdict() for m in matches)
df["File"] = [folder / m.string for m in matches]
df["Image"] = [mpimg.imread(f) for f in df["File"].to_numpy()]
return df
def filename_matcher(files: Iterable) -> List:
"""Match the desired pattern to the filename, i.e. extracts the data from
the filename into a match object. More flexible and via regex you
could also separate numbers from units or similar."""
# create regex pattern that groups the parts between underscores
pattern = "_".join(f"(?P<{name}>[^_] )" for name in FILEPARTS)
pattern = re.compile(pattern)
# match the pattern
matches = (pattern.match(f.name) for f in files)
return [match for match in matches if match is not None]
def via_split(path_to_folder: str) -> pd.DataFrame:
""" Assumes all .jpg files have the right naming."""
folder = Path(path_to_folder)
# select only .jpg files
files = folder.glob('*.jpg')
df = pd.DataFrame(columns=FILEPARTS ["File", "Image"], index=range(len(files)))
for idx, f in enumerate(files):
df.loc[idx, FILEPARTS] = f.stem.split('_')
df.loc[idx, "File"] = f
df.loc[idx, "Image"] = mpimg.imread(f)
return df
if __name__ == '__main__':
df_re = via_regex('dir')
df_split = via_split('dir')