I'm try to convert the data in an image to a data frame in R using tesseract, but have run into a problem, perhaps due to my use of regular expressions.
library(magick)
library(tesseract)
team_img <- image_read("measuring.png")
team_mgk <- team_img %>%
image_resize('2000x') %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr()
cat(team_mgk)
text1_a <- gsub('[[:punct:]]', '', team_mgk)
read.table(text=team_mgk,
col.names=c('Time_factor', 'Tree_#', 'Species',
'Fragment','Linear_Extension', 'Colur'))
# Error
# Error in scan(file = file, what = what, sep = sep, quote = quote,
# dec = dec, :
# line 1 did not have 6 elements
The idea is learn how to use OCR to read in a data frame. The image is as follows:
CodePudding user response:
You were very close. The OCR did a good job, with only a single underscore missing in the last row which causes read.table
to throw an error. This can be fixed with simple replacement using sub
.
Since the image is online now at https://i.stack.imgur.com/V9lWV.png after being uploaded in your question, we can create a fully reproducible example.
library(dplyr)
library(magick)
library(tesseract)
df <- "https://i.stack.imgur.com/V9lWV.png" %>%
image_read() %>%
image_resize('2000x') %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr() %>%
strsplit('\n') %>%
getElement(1) %>%
`[`(-1) %>%
{sub('Time 2', 'Time_2', .)} %>%
{read.table(text = .)} %>%
setNames(c('Time_factor', 'Tree #', 'Species', 'Fragment',
'Linear Extension(mm)', 'Colour'))
Resulting in
df
#> Time_factor Tree # Species Fragment Linear Extension(mm) Colour
#> 1 Time_O 31 A.tenius 12A 49.50 Brown
#> 2 Time_1 31 A.tenius 12A 56.72 Brown
#> 3 Time_2 31 A.tenius 12A 74.38 Brown
#> 4 Time_O 31 A.tenius 12B 58.66 Brown
#> 5 Time_1 31 A.tenius 12B 78.45 Brown
#> 6 Time_2 31 A.tenius 12B 94.37 Brown
#> 7 Time_O 31 A.tenius 12C 55.97 Brown
#> 8 Time_1 31 A.tenius 12C 90.12 Brown
#> 9 Time_2 31 A.tenius 12C 121.61 Brown
#> 10 Time_O 31 A.tenius 12D 70.19 Brown
#> 11 Time_1 31 A.tenius 12D 91.82 Brown
#> 12 Time_2 31 A.tenius 12D 115.57 Brown
#> 13 Time_O 34 A.tenius 3B 60.10 Yellow
#> 14 Time_1 34 A.tenius 3B 79.00 Yellow
#> 15 Time_2 34 A.tenius 3B 103.82 Yellow
#> 16 Time_O 34 A.tenius 3C 48.18 Yellow
#> 17 Time_1 34 A.tenius 3C 58.70 Yellow
#> 18 Time_2 34 A.tenius 3C 99.03 Yellow
#> 19 Time_O 34 A.tenius 3D 66.12 Yellow
#> 20 Time_1 34 A.tenius 3D 84.05 Yellow
#> 21 Time_2 34 A.tenius 3D 114.38 Yellow
#> 22 Time_O 34 A.tenius 3E 68.94 Yellow
#> 23 Time_1 34 A.tenius 3E 92.30 Yellow
#> 24 Time_2 34 A.tenius 3E 109.05 Yellow
#> 25 Time_O 34 A.tenius 4A 46.20 Blue
#> 26 Time_1 34 A.tenius 4A 67.00 Blue
#> 27 Time_2 34 A.tenius 4A 127.48 Blue
#> 28 Time_O 34 A.tenius 4B 87.19 Blue
#> 29 Time_1 34 A.tenius 4B 109.18 Blue
#> 30 Time_2 34 A.tenius 4B 109.71 Blue
#> 31 Time_O 34 A.tenius 4C 77.26 Blue
#> 32 Time_1 34 A.tenius 4C 123.57 Blue
#> 33 Time_2 34 A.tenius 4C 135.59 Blue
#> 34 Time_O 34 A.tenius 4D 60.01 Blue
#> 35 Time_1 34 A.tenius 4D 80.32 Blue
#> 36 Time_2 34 A.tenius 4D 101.75 Blue
Created on 2022-08-30 with reprex v2.0.2