Home > Software design >  Parsing between characters using Perl in SAS
Parsing between characters using Perl in SAS

Time:08-24

I am sure this is a simple thing to do but I cannot seem to find any examples or make it past the numerous documentation sources I have been using.

I have a variable in a table (called location) such as: OH_DRT HOME_G4-T7 77 Cafe Entrance

I want to be able to parse this into several columns based on some delimiters. There is variability in my data set so I thought using perl expressions for pattern matching would be the way to go. I am trying to take that string and break it up into something like this:

State Building Name Desc
OH DRT HOME G4 T7 Cafe Entrance
FL Cleveland RG 03 Back Entry

I am able to split the first part out

Data Mydata;
     Set Int_Data;
     retain re;
     if _N_ = 1 Then re = prxparse("/(\D{2})/");

     if prxmatch(re, location) Then Do
          State= prxposn(re,1,location);
end;

It is parsing out any of the other sections I am at a loss for. The only one I have been able to get to work correctly is the State. I assume I should be able to pull anything between two characters.

In my head I should be able to split something like this: Anything before the first _, anything between the first _ and second _, anything second _ to first -, and then finally anything after the -

CodePudding user response:

Are all records exactly the same? If so:

use warnings;
use strict;

my $data = 'OH_DRT HOME_G4-T7 77 Cafe entrance';

my ($state, $building, $name, $desc);

if ($data =~ /^([A-Z]{2})_(.*)_(\w{2})-\w{2}\s (.*)$/) {
    $state = $1;
    $building = $2;
    $name = $3;
    $desc = $4;
}

print "$state, $building, $name, $desc\n";

The regex works as follows:

  • Capture two upper-cased letters at the start of the string and put it into $1
  • Skip an underscore and capture everything until the next underscore and put it into $2
  • Capture the following two word characters and put them into $3
  • Skip a hyphen and the following two word characters along with any amount of whitespace, and put the remaining portion of the string into $4
  • Assign the numbered matches into the more descriptive named variables

Note that if any of the matches/captures fail, all of the named variables will be undefined.

The output of the above is:

OH, DRT HOME, G4, 77 Cafe entrance

CodePudding user response:

You can use a pattern with 4 capture groups, but note that when keeping the following remark into account, it will give T7 77 Cafe entrance in the last group.

and then finally anything after the -

If you want to match anything between the underscores and the - you can use a negated character class excluding characters to match that you specify.

To not cross newlines, you can add a newline and a carriage return [^_\r\n]

^([^_] )_([^_] )_([^-] )-(.*)

Explanation

  • ^ Start of string
  • ([^_] )_ Capture 1 chars other than _ in group 1 and then match it
  • ([^_] )_ Capture 1 chars other than _ in group 2 and then match it
  • ([^-] )- Capture 1 chars other than - in group 3 and then match it
  • (.*) Match all after the underscore in group 4

Regex demo

If you want 77 Cafe entrance in group 4:

^([^_] )_([^_] )_([^-] )-[^\s-]*\s*(.*)

Regex demo

  • Related