I am sure this is a simple thing to do but I cannot seem to find any examples or make it past the numerous documentation sources I have been using.
I have a variable in a table (called location) such as: OH_DRT HOME_G4-T7 77 Cafe Entrance
I want to be able to parse this into several columns based on some delimiters. There is variability in my data set so I thought using perl expressions for pattern matching would be the way to go. I am trying to take that string and break it up into something like this:
State | Building | Name | Desc |
---|---|---|---|
OH | DRT HOME | G4 | T7 Cafe Entrance |
FL | Cleveland | RG | 03 Back Entry |
I am able to split the first part out
Data Mydata;
Set Int_Data;
retain re;
if _N_ = 1 Then re = prxparse("/(\D{2})/");
if prxmatch(re, location) Then Do
State= prxposn(re,1,location);
end;
It is parsing out any of the other sections I am at a loss for. The only one I have been able to get to work correctly is the State. I assume I should be able to pull anything between two characters.
In my head I should be able to split something like this: Anything before the first _, anything between the first _ and second _, anything second _ to first -, and then finally anything after the -
CodePudding user response:
Are all records exactly the same? If so:
use warnings;
use strict;
my $data = 'OH_DRT HOME_G4-T7 77 Cafe entrance';
my ($state, $building, $name, $desc);
if ($data =~ /^([A-Z]{2})_(.*)_(\w{2})-\w{2}\s (.*)$/) {
$state = $1;
$building = $2;
$name = $3;
$desc = $4;
}
print "$state, $building, $name, $desc\n";
The regex works as follows:
- Capture two upper-cased letters at the start of the string and put it into
$1
- Skip an underscore and capture everything until the next underscore and put it into
$2
- Capture the following two word characters and put them into
$3
- Skip a hyphen and the following two word characters along with any amount of whitespace, and put the remaining portion of the string into
$4
- Assign the numbered matches into the more descriptive named variables
Note that if any of the matches/captures fail, all of the named variables will be undefined.
The output of the above is:
OH, DRT HOME, G4, 77 Cafe entrance
CodePudding user response:
You can use a pattern with 4 capture groups, but note that when keeping the following remark into account, it will give T7 77 Cafe entrance
in the last group.
and then finally anything after the -
If you want to match anything between the underscores and the -
you can use a negated character class excluding characters to match that you specify.
To not cross newlines, you can add a newline and a carriage return [^_\r\n]
^([^_] )_([^_] )_([^-] )-(.*)
Explanation
^
Start of string([^_] )_
Capture 1 chars other than_
in group 1 and then match it([^_] )_
Capture 1 chars other than_
in group 2 and then match it([^-] )-
Capture 1 chars other than-
in group 3 and then match it(.*)
Match all after the underscore in group 4
If you want 77 Cafe entrance
in group 4:
^([^_] )_([^_] )_([^-] )-[^\s-]*\s*(.*)