I have a CSV file with one of the fields holding state/country info, formatted like: "Florida United States" or "Alberta Canada" or "Wellington New Zealand" - not comma or tab delimited between them, simply space delimited.
I have an array of all the potential countries as well.
What I am looking for, is a solution that, in a loop, I can split the State and Country to different variables, based on matching the country in the $countryarray that I have something like:
$countryarray=array("United States","Canada","New Zealand");
$userfield="Wellington New Zealand");
$somefunction=(match "New Zealand", extract into $country, the rest into $state)
Split won't do it straight up - because many of the countries AND states have spaces, but the original data set concatenated the state and country together with just a space...
TIA!
CodePudding user response:
I'm a fan of the RegEx method that @Mike Morton mentioned. You can take an array of countries, implode them using the |
which is a RegEx OR
, and use that as an "ends with one of these" pattern.
Below I've come up with two ways to do this, a simple way and an arguably overly complicated way that does some extra escaping. To illustrate what that escaping does I've added a fake country called Country XYZ (formally ABC)
.
Here's the sample data that works with both methods, as well as a helper function that actually does the matching and echoing. The RegEx does named-capturing, too, which makes things really easy to deal with.
// Sample data
$data = [
'Wellington New Zealand',
'Florida United States of America',
'Quebec Canada',
'Something Country XYZ (formally ABC)',
];
// Array of all possible countries
$countries = [
'United States of America',
'Canada',
'New Zealand',
'Country XYZ (formally ABC)',
];
// The begining and ending pattern delimiter for the RegEx
$delim = '/';
function matchAndShowData(array $data, array $countries, string $delim, string $countryParts): void
{
$pattern = "^(?<region>.*?) (?<country>$countryParts)$";
foreach($data as $d) {
if(preg_match($delim . $pattern . $delim, $d, $matches)){
echo sprintf('%1$s, %2$s', $matches['region'], $matches['country']), PHP_EOL;
} else {
echo 'NO MATCH: ' . $d, PHP_EOL;
}
}
}
Option 1
The first option is a naïve implode. This method, however, will not find the country that includes parentheses.
matchAndShowData($data, $countries, $delim, implode('|', $countries));
Output
Wellington, New Zealand
Florida, United States of America
Quebec, Canada
NO MATCH: Something Country XYZ (formally ABC)
Option 2
The second option applies proper RegEx quoting of the countries, just in case they have special characters. If you are 100% certain you don't have any, this is overkill, but I personally have learned, after way too many hours of debugging, to just always quote, just in case.
$patternParts = array_map(fn(string $country) => preg_quote($country, $delim), $countries);
// Implode the cleaned countries using the RegEx pipe operator which means "OR"
matchAndShowData($data, $countries, $delim, implode('|', $patternParts));
Output
Wellington, New Zealand
Florida, United States of America
Quebec, Canada
Something, Country XYZ (formally ABC)
Note
If you don't expect your list of countries to change often you can echo
the pattern out and then just bake that into your code which will probably shave a couple of milliseconds of execution, which in a tight loop might be worth it.
Demo
You can see a demo of this here: https://3v4l.org/CaNRZ
CodePudding user response:
- Prepare the array of countries for use in a regular expression with
preg_quote()
. - Build a regex pattern that will match a space followed by one of the country values then the end of the string. A lookahead (
(?= ... )
) is used to ensure that those matched characters are not consumed/destroyed while exploding. - Save the 2-element returned array from
preg_split()
to the output array.
Code: (Demo)
$branches = array_map(fn($country) => preg_quote($country, '/'), $countries);
$result = [];
foreach ($data as $string) {
$result[] = preg_split('/ (?=(?:' . implode('|', $branches) . ')$)/', $string);
}
var_export($result);
Output:
array (
0 =>
array (
0 => 'Wellington',
1 => 'New Zealand',
),
1 =>
array (
0 => 'Florida',
1 => 'United States of America',
),
2 =>
array (
0 => 'Quebec',
1 => 'Canada',
),
3 =>
array (
0 => 'Something',
1 => 'Country XYZ (formally ABC)',
),
)
Note that if an item/row in the result array only has one element, then you know that the attempted split failed to match the country substring.
I use this same technique when splitting street name and street type (when things like "First Street North" (a multi-word street type)) happens.