ValueError: Found input variables with inconsistent numbers of samples: [10001, 0]-CodePudding

I was trying to split data with sklearn and I keep getting this error? This is the full documentation of what I am trying to do: https://www.kaggle.com/code/vencerlanz09/pharmaceutical-drugs-classification-using-yolov5#✂️Splitting-the-Dataset

# Read images and annotations
image_dir = r"C:/Users/X3/pharmaceutical-drugs-and-vitamins-synthetic-images/ImageClassesCombinedWithCOCOAnnotations/images_raw"
images = [os.path.join(image_dir, x) for x in os.listdir(image_dir)]
annotations = [os.path.join('C:/Users/X3/1/text_files', x) for x in os.listdir('C:/Users/X3/1/text_files') if x[-3] == "txt"]

images.sort()
annotations.sort()

# Split the dataset into train-valid-test splits 
train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 0.2, random_state = 1)
val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)

** The error I am getting is:**

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_6292\1715792042.py in <module>
      8 
      9 # Split the dataset into train-valid-test splits
---> 10 train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 1.0, random_state = 1)
     11 val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)

~\anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
   2415         raise ValueError("At least one array required as input")
   2416 
-> 2417     arrays = indexable(*arrays)
   2418 
   2419     n_samples = _num_samples(arrays[0])

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
    376 
    377     result = [_make_indexable(X) for X in iterables]
--> 378     check_consistent_length(*result)
    379     return result
    380 

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [10001, 0]

CodePudding user response：

This error is saying you have 10001 images and 0 annotations.

Make sure you're finding the correct files in the line

annotations = [os.path.join('C:/Users/X3/1/text_files', x) for x in os.listdir('C:/Users/X3/1/text_files') if x[-3] == "txt"]

CodePudding user response：

Modified Code that worked:

# Read images and annotations
image_dir = r"C:/Users/X3/pharmaceutical-drugs-and-vitamins-synthetic-images/ImageClassesCombinedWithCOCOAnnotations/images_raw"
images = [os.path.join(image_dir, x) for x in os.listdir(image_dir)]

# Create a list of the file names of the images in the image_dir directory, without the full paths
image_filenames = [x for x in os.listdir(image_dir)]

# Create a list of the annotation paths that correspond to the images in the images list
annotations = []
for image_filename in image_filenames:
    annotation_filename = image_filename[:-3]   "txt"
    annotation_path = os.path.join('C:/Users/X3/text_files', annotation_filename)
    annotations.append(annotation_path)

# Sort the images and annotations lists
images.sort()
annotations.sort()

# Split the dataset into train-valid-test splits 
train_images, val_images, train_annotations, val_annotations = train_test_split(images, annotations, test_size = 0.2, random_state = 1)
val_images, test_images, val_annotations, test_annotations = train_test_split(val_images, val_annotations, test_size = 0.5, random_state = 1)