Home > database >  REGEX to match 3 or 4 columns of data, one of which might contain spaces
REGEX to match 3 or 4 columns of data, one of which might contain spaces

Time:07-02

I'm trying to parse, in PHP, a data file describing network nodes. The data file consists of 4 columns, with the 4th column being optional:

<Node name>  <Operating System> <Description> <SSL info>

e.g.

SIG.SND.SERV                   ZOS        A Big client (MF  LPAR   PROD)
ADC1                           ZOS        AMEX                           SSL
ADEPTRA.GB1.PROD               LINUX      BANK OF SOMEWHERE            TLS
ADEPTRA.GB2.CQA                LINUX      BANK OF SOMEWHERE            TLS
AIX.EG3C                       UNIX       BARCLAYS                       S 
AIX.EG3P                       UNIX       BARCLAYS                       S 
AIX.RMWDEV1                    UNIX       FDCS
AIX.RMWPROD1                   UNIX       FDCS
AIX-EFXWRWCK01                 UNIX       EQUIFAX EUROPE
ANB-DRC-CDFDI                  NT         ARAB NATIONAL BANK (ANB)      SSL

My problems lie with column3, the description, which may have spaces and column 4 with may be absent.

I've tried the following REGEXes (with the REGEX checker at https://regex101.com/ :

(\S )\s (\S )\s (\S. ?\S)\s (\S. )?

(This was my attempt to express 'non-whitespace followed by some spaces, non-whitespace followed by some spaces, characters delimited by non-whitespace followed by at least 2 spaces

which works except when the 4th column is absent. My attempts to make the last spaces and characters optional results in columns 3 and 4 matching together:

(\S )\s{2,} (\S )\s{2,} (\S. \S)(\s{2,})? ?(\S. )?

I also tried to create a REGEX that specified 'Any characters deliminated by non-whitespace followed by either at least 2 spaces or the end of the string' :

(\S. ?\S)[\s{2}|$]

which is also 'close but no cigar'.

Sadly yes - that description in the first line does have multiple spaces between the words.

Edit : I've just seen that the description field can have brackets in it, which complicated things e.g. :

ADEPTRA.GB2.CQA                LINUX      FICO (ADEPTRA) LTD            TLS

Added a link to one of my attempts: https://regex101.com/r/k9JlZu/1

CodePudding user response:

You may use this regex to capture 3 or 4 groups in each line:

^(\S )\h (\S )\h (\w (?:\h\w )*(?:\h*\([^)] \)(?:\h\w )?)?)(?:\h{2,}(\S ))?$

RegEx Demo

RegEx Details:

  • ^: Start
  • (\S ): Match 1 non-whitespace characters in capture group #1
  • \h : Match 1 whitespaces
  • (\S ): Match 1 non-whitespace characters in capture group #2
  • \h : Match 1 whitespaces
  • (\w (?:\h\w )*(?:\h*\([^)] \))?): Capture group #3 to match spaces separated words optionally followed by a (...) string
  • (?:\h (\S ))?: An optional non-capture group in the end to capture 1 non-whitespace characters in capture group #4
  • $: End

CodePudding user response:

Code: (Demo)

preg_match_all(
    '/^(\S )\h (\S )\h (\S (?:\h\S |\h \([^)] \))*)(?:\h (SSL|TLS|S\ ))?/m',
    $txt,
    $m
);
var_export($m);

It only seems that the 3rd group is the tricky one.

I recommend:

(                 # capture group #3
  \S              # one or more visible characters
  (?:             # a non-capturing group
    \h\S          # a single horizontal space followed by one or more visible characters
    |             # or
    \h \([^)] \)  # one or more horizontal spaces, opening parenthesis, one or more non-closing parentheses, closing parenthesis
  )*              # zero or more times
)

The fourth/optional column seems to have finite values -- perhaps just list these explicitly in your pattern.

  • Related