Home > Enterprise >  Using OCR to convert a spreadsheet image to data in R
Using OCR to convert a spreadsheet image to data in R

Time:08-31

I'm try to convert the data in an image to a data frame in R using tesseract, but have run into a problem, perhaps due to my use of regular expressions.

library(magick)
library(tesseract)

team_img <- image_read("measuring.png")

 team_mgk <- team_img %>%
  image_resize('2000x') %>%
  image_convert(type = 'Grayscale') %>%
  image_trim(fuzz = 40) %>%
  image_write(format = 'png', density = '300x300') %>%
  tesseract::ocr()
 
 cat(team_mgk)
 
 text1_a <- gsub('[[:punct:]]', '', team_mgk)

 read.table(text=team_mgk, 
            col.names=c('Time_factor', 'Tree_#', 'Species', 
                        'Fragment','Linear_Extension', 'Colur'))

 # Error
 # Error in scan(file = file, what = what, sep = sep, quote = quote, 
 # dec = dec,  : 
 # line 1 did not have 6 elements

The idea is learn how to use OCR to read in a data frame. The image is as follows:

enter image description here

CodePudding user response:

You were very close. The OCR did a good job, with only a single underscore missing in the last row which causes read.table to throw an error. This can be fixed with simple replacement using sub.

Since the image is online now at https://i.stack.imgur.com/V9lWV.png after being uploaded in your question, we can create a fully reproducible example.

library(dplyr)
library(magick)
library(tesseract)

df <- "https://i.stack.imgur.com/V9lWV.png" %>%
  image_read() %>%
  image_resize('2000x') %>%
  image_convert(type = 'Grayscale') %>%
  image_trim(fuzz = 40) %>%
  image_write(format = 'png', density = '300x300') %>%
  tesseract::ocr() %>%
  strsplit('\n') %>%
  getElement(1) %>%
  `[`(-1) %>%
  {sub('Time 2', 'Time_2', .)} %>%
  {read.table(text = .)} %>%
  setNames(c('Time_factor', 'Tree #', 'Species', 'Fragment', 
             'Linear Extension(mm)', 'Colour'))

Resulting in

df
#>    Time_factor Tree #  Species Fragment Linear Extension(mm) Colour
#> 1       Time_O     31 A.tenius      12A                49.50  Brown
#> 2       Time_1     31 A.tenius      12A                56.72  Brown
#> 3       Time_2     31 A.tenius      12A                74.38  Brown
#> 4       Time_O     31 A.tenius      12B                58.66  Brown
#> 5       Time_1     31 A.tenius      12B                78.45  Brown
#> 6       Time_2     31 A.tenius      12B                94.37  Brown
#> 7       Time_O     31 A.tenius      12C                55.97  Brown
#> 8       Time_1     31 A.tenius      12C                90.12  Brown
#> 9       Time_2     31 A.tenius      12C               121.61  Brown
#> 10      Time_O     31 A.tenius      12D                70.19  Brown
#> 11      Time_1     31 A.tenius      12D                91.82  Brown
#> 12      Time_2     31 A.tenius      12D               115.57  Brown
#> 13      Time_O     34 A.tenius       3B                60.10 Yellow
#> 14      Time_1     34 A.tenius       3B                79.00 Yellow
#> 15      Time_2     34 A.tenius       3B               103.82 Yellow
#> 16      Time_O     34 A.tenius       3C                48.18 Yellow
#> 17      Time_1     34 A.tenius       3C                58.70 Yellow
#> 18      Time_2     34 A.tenius       3C                99.03 Yellow
#> 19      Time_O     34 A.tenius       3D                66.12 Yellow
#> 20      Time_1     34 A.tenius       3D                84.05 Yellow
#> 21      Time_2     34 A.tenius       3D               114.38 Yellow
#> 22      Time_O     34 A.tenius       3E                68.94 Yellow
#> 23      Time_1     34 A.tenius       3E                92.30 Yellow
#> 24      Time_2     34 A.tenius       3E               109.05 Yellow
#> 25      Time_O     34 A.tenius       4A                46.20   Blue
#> 26      Time_1     34 A.tenius       4A                67.00   Blue
#> 27      Time_2     34 A.tenius       4A               127.48   Blue
#> 28      Time_O     34 A.tenius       4B                87.19   Blue
#> 29      Time_1     34 A.tenius       4B               109.18   Blue
#> 30      Time_2     34 A.tenius       4B               109.71   Blue
#> 31      Time_O     34 A.tenius       4C                77.26   Blue
#> 32      Time_1     34 A.tenius       4C               123.57   Blue
#> 33      Time_2     34 A.tenius       4C               135.59   Blue
#> 34      Time_O     34 A.tenius       4D                60.01   Blue
#> 35      Time_1     34 A.tenius       4D                80.32   Blue
#> 36      Time_2     34 A.tenius       4D               101.75   Blue

Created on 2022-08-30 with reprex v2.0.2

  • Related