I've got a CSV parser and the file has multiple strings which I need to map to different fields within one excel cell.
So the example image above I have the postcode, state, and suburb which are all in the same cell - but I need to map these to their respective tables in the database.
How would I do this? The team is saying likely regex, if so what would be the regex I use?
Cheers guys!
I expect each type of data to be split into the relevant data table
CodePudding user response:
Regex they say? Then you will have two problems.
Tackle this with extracting the postcode, then a string split. Here in psuedo code:
For i = 1 to lines.Length
string s = lines[i];
string postCode = s[s.Length -4]; // Get postcode
s = s[0, s.Length -4].TrimEnd(); //Remove postcode and trim last space
arr[] = s.Split(" "); // Split words into array
string state= arr[arr.Length - 1]; //The last word is the state
string suburb;
For j = 0 to arr.Length - 1
suburb = arr[j];
Next j
Next i
CodePudding user response:
Not sure what language/framework you're using, but I'm just posting this in vanilla javascript.
The regex I'd use for each cell is:
/([\w\s] )\s([A-Z]{2,3})\s(\d{4})/
Which breaks down into:
([\w\s] )
capture multiple words or spaces\s
followed by a space([A-Z]{2,3})
capture a group of two or three capital letters\s
followed by a space(\d{4})
capture four digits
I always find https://www.regexpal.com really useful to write and test my regexes.
Using that in javascript:
const addresses = [
'Elizabeth Bay NSW 2011',
'Sydney NSW 2000',
'St Kilda VIC 3182',
'Sausage 1234567890',
'Middle Park QLD 4047',
'The Big Pineapple QLD 4559',
'Uluru NT 0872'
]
const reg = /([\w\s] )\s([A-Z]{2,3})\s(\d{4})/
const matches = addresses.map(address => {
const match = address.match(reg)
return match && match.slice(1)
})
console.log(matches)
Will give you:
// Matching string, Suburb, State, Postcode
["Elizabeth Bay NSW 2011", "Elizabeth Bay", "NSW", "2011"]
["Sydney NSW 2000", "Sydney", "NSW", "2000"]
["St Kilda VIC 3182", "St Kilda", "VIC", "3182"]
null // The non-matching entry
["Middle Park QLD 4047", "Middle Park", "QLD", "4047"]
["The Big Pineapple QLD 4559", "The Big Pineapple", "QLD", "4559"]
["Uluru NT 0872", "Uluru", "NT", "0872"]