Home > Software engineering >  Is there a way to read and collect EMF image file in python? Can we read a EMF image with OpenCV ? H
Is there a way to read and collect EMF image file in python? Can we read a EMF image with OpenCV ? H

Time:05-10

I am searching for a solution for a long time but couldn't be able to find it. There are more similar qestion-answers but that didn't help me.

Basically

  1. I have some word documents (xxx.docx) having some images.
  2. That image is in WMF format (when I am manually checking it) and it basically contains tabular information.
  3. I need to collect that table.
    So the task is reduced to collect the image and get table from text using computer vision.

enter image description here

I think it has messed up a little on my system because I lack your fonts.

CodePudding user response:

I don't know much about Python, but I've implemented the WMF/EMF/EMF classes in Apache POI. I would use the location of the text records to give them some meaning. The rest is for you to figure out, e.g. by only using lines with the same amount of columns.

import java.awt.geom.Point2D;
import java.awt.geom.Rectangle2D;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import org.apache.poi.hemf.usermodel.HemfPicture;
import org.apache.poi.hwmf.record.HwmfText.WmfExtTextOut;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.junit.jupiter.api.Test;

public class TestWmfExtract {
    @Test
    void blub() throws IOException {
        Map<Double, Map<Double,String>> tab = new TreeMap<>();

        try (InputStream is = new FileInputStream("36C77022Q0250.docx");
             XWPFDocument doc = new XWPFDocument(is);
             InputStream is2 = doc.getAllPictures().get(0).getPackagePart().getInputStream()
        ) {
            HemfPicture emf = new HemfPicture(is2);

            Stream<WmfExtTextOut> st = emf.getRecords().stream()
                .filter(r -> r instanceof WmfExtTextOut)
                .map(WmfExtTextOut.class::cast);
            for (WmfExtTextOut hr : (Iterable<WmfExtTextOut>) (st::iterator)) {
                Point2D p2d = hr.getReference();
                String txt = hr.getText(StandardCharsets.UTF_16LE);
                Rectangle2D bi = (Rectangle2D)hr.getGenericProperties().get("boundsIgnored").get();
                double x = bi != null ? bi.getCenterX() : p2d.getX();
                x = 20. * Math.round(x / 20.);
                tab.computeIfAbsent(p2d.getY(), (d) -> new TreeMap<>()).put(x, txt);
            }

            List<Double> colX = tab.values().stream().flatMap((m) -> m.keySet().stream())
                .distinct().sorted().collect(Collectors.toList());

            try (Workbook wb = new XSSFWorkbook();
                 FileOutputStream fos = new FileOutputStream("tab-out.xlsx")) {
                Sheet sh = wb.createSheet();

                int rowIdx = 0;
                for (Map<Double, String> cols : tab.values()) {
                    Row row = sh.createRow(rowIdx);
                    cols.forEach((x, txt) -> row.createCell(colX.indexOf(x)).setCellValue(txt));
                    rowIdx  ;
                }

                wb.write(fos);
            }
        }
    }
}
  • Related