I am building a image classifier, and for that I have a dataset in one folder and there is no separate train or test for all the images. So what I want to do here is that our logic will check all the subdirectories and for all the subdirectories it only deletes the randomly picked 30% of the images (Note: No. of images in every folder is not same also the format some are .jpg and some .png) or we can separate the 30% and 70% of the images in two different folder.
I have looked into many articles but not able to achieve this.
This is how the directory looks like.
CodePudding user response:
I have not tested it but you can either directly use or modify this script called Thanos that I just found based on your preference. It doesn't have a license so I guess I won't copy-paste it here.
CodePudding user response:
Suppose this folder hierarchy:
images
├── color
│ ├── color1.jpg
│ ├── color2.png
│ ├── color3.png
│ ├── color4.png
│ └── color5.jpg
└── shape
├── shape1.jpg
├── shape2.png
├── shape3.png
├── shape4.png
└── shape5.jpg
You can try something like that:
import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split
data = []
for f in pathlib.Path('./images').glob('**/*'):
if f.suffix in ['.jpg', '.png']:
data.append((f.parent.name, str(f)))
df = pd.DataFrame(data, columns=['feature', 'image'])
data = {}
for feature, subdf in df.groupby('feature'):
train, test = train_test_split(subdf['image'], train_size=.7)
data[feature] = {'train': train.to_list(), 'test': test.to_list()}
Output:
>>> data
{'color': {'train': ['images/color/color1.jpg',
'images/color/color2.png',
'images/color/color5.jpg'],
'test': ['images/color/color3.png', 'images/color/color4.png']},
'shape': {'train': ['images/shape/shape4.png',
'images/shape/shape1.jpg',
'images/shape/shape3.png'],
'test': ['images/shape/shape5.jpg', 'images/shape/shape2.png']}}