Clean up badly formatted questionnaires using regex-CodePudding

I have a badly formatted questionnaire where the answers (and accompanying newlines) often appear somewhere in the questions. This is a problem for sentence (i.e. a question and corresponding answer) segmentation thus making a model very difficult to extract information from each Q&A!

Example:

\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\n

Print version of the example:

01 Do you have preexisting      No
conditions?
02 Within the past 12 months I worried about          Never True
my health would get worse.
03 Within the past 12 months I have had         Never True
high blood pressure.
04 What is your housing situation today?   I have housing
05 How many times have you moved in the past 12        Zero (I 
did not move)
months?
06 Are you worried that in the next 2 months, you may not    No
have your own housing to live in?
07 Do you have trouble paying your heating or electricity    No
bill?
08 Do you have trouble paying for medicines?                 No
09 Are you currently unemployed and looking for work?        No
10 Are you interested in more education?                     Yes

Expected output:

If the answer is located somewhere in the question then move to the end of the sentence;
Remove unnecessary white spaces and newlines in the question;
Replace the question mark or other punctuations at the end of the question with : so the sentence segmentation model gets to include the answer after : before the next question.

Expected example output:

\n01 Do you have preexisting conditions: No\n02 Within the past 12 months I worried about my health would get worse: Never True\n03 Within the past 12 months I have had high blood pressure: Never True\n04 What is your housing situation today: I have housing\n05 How many times have you moved in the past 12 months: Zero (I did not move)\n06 Are you worried that in the next 2 months, you may not have your own housing to live in: No\n07 Do you have trouble paying your heating or electricity bill: No\n08 Do you have trouble paying for medicines: No\n09 Are you currently unemployed and looking for work: No\n10 Are you interested in more education: Yes\n\n

I have been trying to match consecutive \n(0[1-9]|1[0-3])s, and use re.sub with lambda m: m.group() but with no luck so far. Any advice is welcomed!

CodePudding user response：

This is close, I believe:

import re

question_break_re = re.compile("\n(?=\d{2} )")
answer_re = re.compile("\s{2,}([^\n] )")
whitespace_re = re.compile("\s ")
end_of_question_mark_re = re.compile(r"(?:\?|\.)?$")

def tidy_up_question(question):
    answer = None
    match = answer_re.search(question)
    if match:
        answer = match.group(1)
        question = question[:match.start(0)]   question[match.end(0):]
    question = whitespace_re.sub(' ', question).strip()
    if answer is not None:
        question = end_of_question_mark_re.sub(f": {answer}", question, count=1)
    return question


text = "\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\n"

q_and_a = [
    tidy_up_question(question)
    for question in question_break_re.split(text)
    if question.strip()
]

print('\n'.join(q_and_a))

Output:

01 Do you have preexisting conditions: No
02 Within the past 12 months I worried about my health would get worse: Never True
03 Within the past 12 months I have had high blood pressure: Never True
04 What is your housing situation today: I have housing
05 How many times have you moved in the past 12 months: Zero (I did not move)
06 Are you worried that in the next 2 months, you may not have your own housing to live in: No
07 Do you have trouble paying your heating or electricity bill: No
08 Do you have trouble paying for medicines: No
09 Are you currently unemployed and looking for work: No
10 Are you interested in more education: Yes

This will fail on some corner cases: for example, if that 12 was at the start of the next line, it would have been considered as a start of a new question. Also, any multiple consecutive whitespace that does not immediately precede an answer will likewise mess things up.

The method I used: cut up the string into questions with the working theory that all of them start a line with a two-digit number; identify the answer as the piece of text between multiple whitespace and the newline; finally replace the end punctuation with the colon and the answer.