Delete 30% of images from all subdirectories-CodePudding

I am building a image classifier, and for that I have a dataset in one folder and there is no separate train or test for all the images. So what I want to do here is that our logic will check all the subdirectories and for all the subdirectories it only deletes the randomly picked 30% of the images (Note: No. of images in every folder is not same also the format some are .jpg and some .png) or we can separate the 30% and 70% of the images in two different folder.

I have looked into many articles but not able to achieve this.

This is how the directory looks like.

CodePudding user response：

I have not tested it but you can either directly use or modify this script called Thanos that I just found based on your preference. It doesn't have a license so I guess I won't copy-paste it here.

CodePudding user response：

Suppose this folder hierarchy:

images
├── color
│   ├── color1.jpg
│   ├── color2.png
│   ├── color3.png
│   ├── color4.png
│   └── color5.jpg
└── shape
    ├── shape1.jpg
    ├── shape2.png
    ├── shape3.png
    ├── shape4.png
    └── shape5.jpg

You can try something like that:

import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split

data = []
for f in pathlib.Path('./images').glob('**/*'):
    if f.suffix in ['.jpg', '.png']:
        data.append((f.parent.name, str(f)))
df = pd.DataFrame(data, columns=['feature', 'image'])

data = {}
for feature, subdf in df.groupby('feature'):
    train, test = train_test_split(subdf['image'], train_size=.7)
    data[feature] = {'train': train.to_list(), 'test': test.to_list()}

Output:

>>> data
{'color': {'train': ['images/color/color1.jpg',
   'images/color/color2.png',
   'images/color/color5.jpg'],
  'test': ['images/color/color3.png', 'images/color/color4.png']},
 'shape': {'train': ['images/shape/shape4.png',
   'images/shape/shape1.jpg',
   'images/shape/shape3.png'],
  'test': ['images/shape/shape5.jpg', 'images/shape/shape2.png']}}