Home > Net >  LibreOffice Calc REGEX to extract multiple choice question/answer strings
LibreOffice Calc REGEX to extract multiple choice question/answer strings

Time:08-05

This looks like an easy task but I'm not being able to achieve it with LibreOffice Calc REGEX formulas. I have a bunch of exams, each one with 20 multiple choice questions in this format:

1. Which of the following statements about producers is false?
a. Households produce many goods and services for themselves.
b. People set up some producers who do not aim to make profits.
c. All the goods and services consumed in any country are produced by its own producers.
d. Governments arrange the production of some goods and services.

Some questions have several paragraphs with new line characters in between. What I want to achieve is to capture/extract each one of these strings (question / answer a / answer b / answer c / answer d) using REGEX in LibreOffice. The idea is having one REGEX for each string to separate the info in different cells like this: enter image description here

Formula in B1:

=TRIM(FILTERXML("<t><s>"&REGEX(SUBSTITUTE(SUBSTITUTE($A$1,"&","&amp;"),"<","&lt;"),"(?<=\n|^)(?=\d \.|[a-d]\.)","</s><s>","g")&"</s></t>","//s[node()]["&COLUMN(A1) ROW(A1)*5-5&"]"))

Drag 5 columns right and x-amount rows down untill no more values.


Regex pattern: (?<=\n|^)(?=\d \.|[a-d]\.) means:

  • (?<=\n|^) - Positive lookbehind to assert position is preceded by newline character or start-string;
  • (?=\d \.|[a-d]\.) - Positive lookbehind to assert position is followed by 1 digits or character a-d and literal dot.

Xpath expression //s[node()]["&COLUMN(A1) ROW(A1)*5-5&"] means:

  • //s[node()] - Any non-empty nodes;
  • ["&COLUMN(A1) ROW(A1)*5-5&"] - Nested math to return the appropriate index from elements.
  • Related