Parsing between characters using Perl in SAS-CodePudding

I am sure this is a simple thing to do but I cannot seem to find any examples or make it past the numerous documentation sources I have been using.

I have a variable in a table (called location) such as: OH_DRT HOME_G4-T7 77 Cafe Entrance

I want to be able to parse this into several columns based on some delimiters. There is variability in my data set so I thought using perl expressions for pattern matching would be the way to go. I am trying to take that string and break it up into something like this:

State	Building	Name	Desc
OH	DRT HOME	G4	T7 Cafe Entrance
FL	Cleveland	RG	03 Back Entry

I am able to split the first part out

Data Mydata;
     Set Int_Data;
     retain re;
     if _N_ = 1 Then re = prxparse("/(\D{2})/");

     if prxmatch(re, location) Then Do
          State= prxposn(re,1,location);
end;

It is parsing out any of the other sections I am at a loss for. The only one I have been able to get to work correctly is the State. I assume I should be able to pull anything between two characters.

In my head I should be able to split something like this: Anything before the first _, anything between the first _ and second _, anything second _ to first -, and then finally anything after the -

CodePudding user response：

Are all records exactly the same? If so:

use warnings;
use strict;

my $data = 'OH_DRT HOME_G4-T7 77 Cafe entrance';

my ($state, $building, $name, $desc);

if ($data =~ /^([A-Z]{2})_(.*)_(\w{2})-\w{2}\s (.*)$/) {
    $state = $1;
    $building = $2;
    $name = $3;
    $desc = $4;
}

print "$state, $building, $name, $desc\n";

The regex works as follows:

Capture two upper-cased letters at the start of the string and put it into $1
Skip an underscore and capture everything until the next underscore and put it into $2
Capture the following two word characters and put them into $3
Skip a hyphen and the following two word characters along with any amount of whitespace, and put the remaining portion of the string into $4
Assign the numbered matches into the more descriptive named variables

Note that if any of the matches/captures fail, all of the named variables will be undefined.

The output of the above is:

OH, DRT HOME, G4, 77 Cafe entrance

CodePudding user response：

You can use a pattern with 4 capture groups, but note that when keeping the following remark into account, it will give T7 77 Cafe entrance in the last group.

and then finally anything after the -

If you want to match anything between the underscores and the - you can use a negated character class excluding characters to match that you specify.

To not cross newlines, you can add a newline and a carriage return [^_\r\n]

^([^_] )_([^_] )_([^-] )-(.*)

Explanation

^ Start of string
([^_] )_ Capture 1 chars other than _ in group 1 and then match it
([^_] )_ Capture 1 chars other than _ in group 2 and then match it
([^-] )- Capture 1 chars other than - in group 3 and then match it
(.*) Match all after the underscore in group 4

Regex demo

If you want 77 Cafe entrance in group 4:

^([^_] )_([^_] )_([^-] )-[^\s-]*\s*(.*)

Regex demo