I was trying to split data with sklearn and I keep getting this error? This is the full documentation of what I am trying to do: https://www.kaggle.com/code/vencerlanz09/pharmaceutical-drugs-classification-using-yolov5#✂️Splitting-the-Dataset
# Read images and annotations
image_dir = r"C:/Users/X3/pharmaceutical-drugs-and-vitamins-synthetic-images/ImageClassesCombinedWithCOCOAnnotations/images_raw"
images = [os.path.join(image_dir, x) for x in os.listdir(image_dir)]
annotations = [os.path.join('C:/Users/X3/1/text_files', x) for x in os.listdir('C:/Users/X3/1/text_files') if x[-3] == "txt"]
images.sort()
annotations.sort()
# Split the dataset into train-valid-test splits
train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 0.2, random_state = 1)
val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)
** The error I am getting is:**
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_6292\1715792042.py in <module>
8
9 # Split the dataset into train-valid-test splits
---> 10 train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 1.0, random_state = 1)
11 val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)
~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2415 raise ValueError("At least one array required as input")
2416
-> 2417 arrays = indexable(*arrays)
2418
2419 n_samples = _num_samples(arrays[0])
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
376
377 result = [_make_indexable(X) for X in iterables]
--> 378 check_consistent_length(*result)
379 return result
380
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
330 uniques = np.unique(lengths)
331 if len(uniques) > 1:
--> 332 raise ValueError(
333 "Found input variables with inconsistent numbers of samples: %r"
334 % [int(l) for l in lengths]
ValueError: Found input variables with inconsistent numbers of samples: [10001, 0]
CodePudding user response:
This error is saying you have 10001 images
and 0 annotations
.
Make sure you're finding the correct files in the line
annotations = [os.path.join('C:/Users/X3/1/text_files', x) for x in os.listdir('C:/Users/X3/1/text_files') if x[-3] == "txt"]
CodePudding user response:
Modified Code that worked:
# Read images and annotations
image_dir = r"C:/Users/X3/pharmaceutical-drugs-and-vitamins-synthetic-images/ImageClassesCombinedWithCOCOAnnotations/images_raw"
images = [os.path.join(image_dir, x) for x in os.listdir(image_dir)]
# Create a list of the file names of the images in the image_dir directory, without the full paths
image_filenames = [x for x in os.listdir(image_dir)]
# Create a list of the annotation paths that correspond to the images in the images list
annotations = []
for image_filename in image_filenames:
annotation_filename = image_filename[:-3] "txt"
annotation_path = os.path.join('C:/Users/X3/text_files', annotation_filename)
annotations.append(annotation_path)
# Sort the images and annotations lists
images.sort()
annotations.sort()
# Split the dataset into train-valid-test splits
train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 0.2, random_state = 1)
val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)