I have the following task:
Use grep with the -Pao options and a regular expression to extract all phone numbers from the broken file (solution: 13 phone numbers). The regular expression should match as closely as possible the following formats of phone numbers and be as short as possible:
I tried to work with the respective beginning of the numbers, to then put them together and keep moving forward.
I now have the following code:
grep -Pao '(\ \d{2}.) | (\d{3,4}) | (\d\s\d{2})' kaputt.txt
(the mode is PCRE)
Unfortunately, the code does not return the desired results, as it seems that search conditions are mutually exclusive. I would therefore be grateful for help here.
CodePudding user response:
Are there blanks on both sides of the pipes? If yes, the first case actually is ( \d{2}.)\s which doesn't match any of the formats.
https://regex101.com/r/qDmGIC/1 - but it will also match come unwanted combinations like 111 (1)11 11
CodePudding user response:
It would be a fool's errand to try and find the absolute shortest regex possible. The following should be fine as no format seems to be an extension of another.
grep -Pao "(?:\ \d\d \d\d \d{7}|\ \d\d (\d\d) \d{5} \- \d\d|\ \d\d (\d)\d\d \d{5}\-\d\d|\ \d\d-\d\d\-\d{7}|\ \d\d \d\d \d{5}\-\d\d|\d{4} \d \d{6}|\d \d\d \/ \d\d \d\d \d\d|\d{8}\-\d\d)" kaputt.txt
It is just the text extracted from your image (!) of the required formats, with x
replaced by \d
, -
replaced by \-
,
replaced by \
, and with each format alternative separated by |
.
If you want to match across lines then the -z
flag is required and each space could be replaced with, for example, \s
.