I'm trying to parse, in PHP, a data file describing network nodes. The data file consists of 4 columns, with the 4th column being optional:
<Node name> <Operating System> <Description> <SSL info>
e.g.
SIG.SND.SERV ZOS A Big client (MF LPAR PROD)
ADC1 ZOS AMEX SSL
ADEPTRA.GB1.PROD LINUX BANK OF SOMEWHERE TLS
ADEPTRA.GB2.CQA LINUX BANK OF SOMEWHERE TLS
AIX.EG3C UNIX BARCLAYS S
AIX.EG3P UNIX BARCLAYS S
AIX.RMWDEV1 UNIX FDCS
AIX.RMWPROD1 UNIX FDCS
AIX-EFXWRWCK01 UNIX EQUIFAX EUROPE
ANB-DRC-CDFDI NT ARAB NATIONAL BANK (ANB) SSL
My problems lie with column3, the description, which may have spaces and column 4 with may be absent.
I've tried the following REGEXes (with the REGEX checker at https://regex101.com/ :
(\S )\s (\S )\s (\S. ?\S)\s (\S. )?
(This was my attempt to express 'non-whitespace followed by some spaces, non-whitespace followed by some spaces, characters delimited by non-whitespace followed by at least 2 spaces
which works except when the 4th column is absent. My attempts to make the last spaces and characters optional results in columns 3 and 4 matching together:
(\S )\s{2,} (\S )\s{2,} (\S. \S)(\s{2,})? ?(\S. )?
I also tried to create a REGEX that specified 'Any characters deliminated by non-whitespace followed by either at least 2 spaces or the end of the string' :
(\S. ?\S)[\s{2}|$]
which is also 'close but no cigar'.
Sadly yes - that description in the first line does have multiple spaces between the words.
Edit : I've just seen that the description field can have brackets in it, which complicated things e.g. :
ADEPTRA.GB2.CQA LINUX FICO (ADEPTRA) LTD TLS
Added a link to one of my attempts: https://regex101.com/r/k9JlZu/1
CodePudding user response:
You may use this regex to capture 3 or 4 groups in each line:
^(\S )\h (\S )\h (\w (?:\h\w )*(?:\h*\([^)] \)(?:\h\w )?)?)(?:\h{2,}(\S ))?$
RegEx Details:
^
: Start(\S )
: Match 1 non-whitespace characters in capture group #1\h
: Match 1 whitespaces(\S )
: Match 1 non-whitespace characters in capture group #2\h
: Match 1 whitespaces(\w (?:\h\w )*(?:\h*\([^)] \))?)
: Capture group #3 to match spaces separated words optionally followed by a(...)
string(?:\h (\S ))?
: An optional non-capture group in the end to capture 1 non-whitespace characters in capture group #4$
: End
CodePudding user response:
Code: (Demo)
preg_match_all(
'/^(\S )\h (\S )\h (\S (?:\h\S |\h \([^)] \))*)(?:\h (SSL|TLS|S\ ))?/m',
$txt,
$m
);
var_export($m);
It only seems that the 3rd group is the tricky one.
I recommend:
( # capture group #3
\S # one or more visible characters
(?: # a non-capturing group
\h\S # a single horizontal space followed by one or more visible characters
| # or
\h \([^)] \) # one or more horizontal spaces, opening parenthesis, one or more non-closing parentheses, closing parenthesis
)* # zero or more times
)
The fourth/optional column seems to have finite values -- perhaps just list these explicitly in your pattern.