Home > Software engineering >  Save img to text result and check if duplicate one by one
Save img to text result and check if duplicate one by one

Time:11-09

I'm doing python OCR image to text, and compare if there is duplicate, I'm checking one by one so that I can locate easier

ex: listA = [1, 2 ,3 , 4, 4, 5, 6]
so when I append list A, can show 4 is duplicate

Mian issue: my list "listOfElems" is empty and want to save text and detect is duplicate in list one by one

from PIL import Image
import pytesseract
import cv2 
import numpy as np
from os import listdir
from os.path import isfile, join

mypath = "/home/DC_ton/desktop/test_11_8/output02"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
print(onlyfiles)

i = 1
listOfElems = []
Number_of_onlyfiles = len(onlyfiles)
while i < Number_of_onlyfiles :
    each_file_path = '/home/DC_ton/desktop/test_11_8/output02/'  onlyfiles[i]
    image = Image.open(each_file_path)
    text = pytesseract.image_to_string(image, lang='eng')
    print(text)
       
    
    for text in listOfElems:
        if text not in listOfElems:
            listOfElems.append(text)
        else:
            print("here get duplicate")
     
    i  =1
    
print(listOfElems)  

newlist = [] 
duplist = []

def checkIfDuplicates_1(listOfElems):
    ''' Check if given list contains any duplicates '''
    if len(listOfElems) == len(set(listOfElems)):
        return False
    else:
        return True
    
result = checkIfDuplicates_1(listOfElems)
if result:
    print('Yes, list contains duplicates')
else:
    print('No duplicates found in list')  


for k in listOfElems:
    if k not in newlist:
        newlist.append(k)
    else:
        duplist.append(k) 
print("List of duplicates", duplist)
  • output: my list "listOfElems" is empty and I want to compare one by one
['final_output_11.png', 'final_output_6.png', 'final_output_17.png', 'final_output_8.png', 'final_output_15.png', 'final_output_14.png', 'final_output_2.png', 'final_output_12.png', 'final_output_21.png', 'final_output_3.png', 'final_output_24.png', 'final_output_18.png', 'final_output_19.png', 'final_output_10.png', 'final_output_29.png', 'final_output_9.png', 'final_output_20.png', 'final_output_7.png', 'final_output_31.png', 'final_output_30.png', 'final_output_25.png', 'final_output_1.png', 'final_output_16.png', 'final_output_5.png', 'final_output_27.png', 'final_output_13.png', 'final_output_28.png', 'final_output_4.png', 'final_output_23.png', 'final_output_26.png', 'final_output_22.png']
CA7T4B2


CAT7T4BF


CAT4B8


CAT4BE


CAT4C4







CAT4C1


CAT4B7


CA7T4CB


 


CAT4cs


CAT4B4


CAT4BA


CAT7T4BC


CA74B9


CAT4BD


(CAT4AF


CAT4CA


[]
No duplicates found in list
List of duplicates []

image link: that I can check "entire set" if duplicate, just don't know for one by one
https://imgur.com/a/RGUumoy

and I searched the discution said the similar case, but I failed for fitting to my case, therefore, I still need a hand How to get Array one by one Randomly in array order in Python

CodePudding user response:

You are creating an empty list, never add anything to it and then iterate over it (nothing)

i = 1
listOfElems = [] # <- empty
Number_of_onlyfiles = len(onlyfiles)
while i < Number_of_onlyfiles :
    each_file_path = '/home/DC_ton/desktop/test_11_8/output02/'  onlyfiles[i]
    image = Image.open(each_file_path)
    text = pytesseract.image_to_string(image, lang='eng')
    print(text)
   

    for text in listOfElems: # <- still empty
        if text not in listOfElems:
            listOfElems.append(text)
        else:
            print("here get duplicate")
 
    i  =1

Easy solution would be to add the current element to the list if it isn't in there already. Like so:

while i < Number_of_onlyfiles :
    each_file_path = '/home/DC_ton/desktop/test_11_8/output02/'  onlyfiles[i]
    image = Image.open(each_file_path)
    text = pytesseract.image_to_string(image, lang='eng')
    print(text)
    if text not in listOfElems:
        listOfElems.append(text)
    else:
        print("Duplicate")

Also note that indexes start at 0, so i should be 0 in the beginning and you don't have to iterate over lists to check if an element is in it, just use the "in" operator.

You could also save a couple of lines by iterating over onlyfiles:

for file in onlyfiles:
    file_path = mypath   file
  • Related