Home > Enterprise >  How to update dataframe using for loop with tesseract output of selected areas for images in a folde
How to update dataframe using for loop with tesseract output of selected areas for images in a folde

Time:11-10

roi = [[(284, 764), (996, 840), 'text', 'name'],
       [(1560, 756), (2312, 836), 'text', 'cnic'],
       [(2000, 704), (2060, 748), 'box', 'corporate'],
       [(2296, 696), (2360, 756), 'box', 'individual'],
      [(1220, 844), (2360, 920), 'text', 'email']]

Above are the selections where I run tesseract if it is text, if it is a box then get '0' or '1' and want to save this to a dataframe which then can be saved to Excel with desired output (of column headers taken from last column of 'roi' above and values taken from output of tesseract and box values (1, or 0).

myPicList = os.listdir(sof_folder)
for j, y in enumerate(myPicList):
    if 'SOF' in y:
        img = cv.imread(sof_folder   "/"   y)

        df = pd.DataFrame()
        pixelThreshold = 1100
        for x, r in enumerate(roi):
            section = img[r[0][1]:r[1][1], r[0][0]:r[1][0]]
            if len(df.columns) < len(roi):

                if r[2] == 'text':
                    df[r[3]] = tess.image_to_string(section)

                if r[2] == 'box':
                    imgGray = cv.cvtColor(section, cv.COLOR_BGR2GRAY)
                    imgThresh = cv.threshold(imgGray, 170, 255, cv.THRESH_BINARY_INV)[1]
                    totalPixels = cv.countNonZero(imgThresh)
                    if totalPixels > pixelThreshold: totalPixels = 1;
                    else: totalPixels = 0
                    df[r[3]] = totalPixels
        
df.to_excel('forms saved.xlsx')

However, it only returns the column names (i.e., name, cnic, email etc).

Shorter version of the code to be seen easily

for x, r in enumerate(roi):
    section = img[r[0][1]:r[1][1], r[0][0]:r[1][0]]
    if r[2] == 'text':
        df[r[3]] = tess.image_to_string(section)

I tried both solutions from here but none worked for me. Second solution is just not working and first one gives weird output as [![one line contains only one tesseract output and only last image's output is retained]

My code edit for this is as follows:

d1 = {}
d = {}
results = []
df = pd.DataFrame(data=d1)
for x, r in enumerate(roi):
    section = img[r[0][1]:r[1][1], r[0][0]:r[1][0]]
    if len(df.columns) < len(roi):
        
        if r[2] == 'text':
            # df[r[3]] = tess.image_to_string(section)
            readings = tess.image_to_string(section)
            d = {r[3]: [readings]}
            df = pd.DataFrame(data=d)
            results.append(df)

        if r[2] == 'box':
            imgGray = cv.cvtColor(section, cv.COLOR_BGR2GRAY)
            imgThresh = cv.threshold(imgGray, 170, 255, cv.THRESH_BINARY_INV)[1]
            totalPixels = cv.countNonZero(imgThresh)
            if totalPixels > pixelThreshold: totalPixels = 1;
            else: totalPixels = 0
            df[r[3]] = totalPixels
            d = {r[3]: [totalPixels]}
            df = pd.DataFrame(data=d)
            results.append(df)
final_df = pd.concat(results, axis=0)
final_df.to_csv("final.csv")

CodePudding user response:

df = pd.DataFrame()

for j, y in enumerate(myPicList):
    if 'SOF' in y:
        with open('dataOutput.csv', 'a ') as f:
            f.write(y   ',')
        img = cv.imread(sof_folder   "/"   y)
        pixelThreshold = 1100
        myData = []
        for x, r in enumerate(roi):
            section = img[r[0][1]:r[1][1], r[0][0]:r[1][0]]
            if len(df.columns) < len(roi):

                if r[2] == 'text':
                    text = tess.image_to_string(section)
                    text = text.replace("\n", " ")
                    myData.append(text)

                if r[2] == 'box':
                    imgGray = cv.cvtColor(section, cv.COLOR_BGR2GRAY)
                    imgThresh = cv.threshold(imgGray, 170, 255, cv.THRESH_BINARY_INV)[1]
                    totalPixels = cv.countNonZero(imgThresh)
                    if totalPixels > pixelThreshold: totalPixels = 1;
                    else: totalPixels = 0
                    myData.append(totalPixels)
                    
        with open('dataOutput.csv', 'a ') as f:
            for data in myData:
                f.write((str(data) ','))
            f.write('\n')

Not exactly stored in a dataframe but I hope this works for you.

  • Related