I'm trying to parse a string splitting on commas except when comma is inside quotes:
$teststring = 'Data1,Data2,"Data 3, more", Data 4'
$tempsting = $null
$tempsting = $teststring -split ',(?=([^\"]*\"[^\"]*\")*[^\"]*$)'
Write-Host $teststring
$i =0
Foreach ($object in $tempsting) {
Write-Host $i " " $tempsting[$i]
#Write-Host $tempsting[$i]
$i
}
Outputs this:
Data1,Data2,"Data 3, more", Data 4
0 Data1
1 Data2,"Data 3, more"
2 Data2
3 "Data 3, more"
4 "Data 3, more"
5 Data 4
What I want is this:
Data1,Data2,"Data 3, more", Data 4
0 Data1
1 Data2
2 "Data 3, more"
3 Data 4
CodePudding user response:
I believe you're looking for a solution that involves manually parsing your string with regex but just in case going to post this as a PowerShell alternative that is more object oriented and uses ConvertFrom-Csv
to parse your string. By no means this is an efficient solution but seems to do what you're looking for.
$string = 'Data1,Data2,"Data 3, more", Data 4'
$parsed = $string, $string | ConvertFrom-Csv
$parsed.PSObject.Properties.Value | & {
begin { $i = 1 }
process {
$delim = ''
if($_.Contains(',')) { $delim = '"' }
($i ).ToString().PadRight(5) [string]::Format("{0}$_{0}", $delim)
}
}
Outputs:
1 Data1
2 Data2
3 "Data 3, more"
4 Data 4
CodePudding user response:
To complement Santiago Squarzon's helpful answer with a regex-based solution:
# Split the string into fields.
$fields = (
'Data1,Data2,"Data 3, more", Data 4' -split '([^",] |"[^"]*")' -notmatch '^,?$'
).Trim()
# Output the resulting fields, preceded by their index.
$i = 0; $fields.ForEach({ [pscustomobject] @{ Ndx=($i ); Value=$_ } })
Output:
Ndx Value
--- -----
0 Data1
1 Data2
2 "Data 3, more"
3 Data 4
Explanation:
The regex passed to the
-split
operator is somewhat repurposed in that it normally is designed to match what separates the fields, but here matches the fields, and by virtue of using a capture group ((...)
), includes the "separator" matches in the output.- For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
Thus, the resulting tokens comprise both the "fields" - which in this case are the actual separators - and the "separators" - which in this case are the actual fields.
-notmatch '^,?$'
is then used to filter out the actual separators to leave just the array of actual fields (that is, array elements that are either the empty string or contain just,
are eliminated).
As an aside: The -split
-notmatch
approach above is really a [regex]::Matches()
solution in disguise (see zett42's helpful answer). Use of [regex]::Matches()
is a more direct expression of the solution's intent, but comes with its own complexity, not least due to not using PowerShell-native features. Bringing the functionality of [regex]::Matches()
to PowerShell in the future in the form of a -matchall
operator has been green-lighted, but is still awaiting implementation as of this writing: see GitHub issue #7867
CodePudding user response:
mklement0 was faster with the RegEx-based solution. To complement, I'm adding a different variant. Instead of splitting the string, I search for each comma-separated field, which can be either quoted or unquoted.
$string = 'Data1,Data2,"Data 3, more", Data 4'
$i = 0
[regex]::Matches( $string, '("[^"] "|[^,] )(?:\s*,\s*|$)' ).ForEach{
($i ).ToString().PadRight(5) $_.Groups[1].Value
}
Output:
0 Data1
1 Data2
2 "Data 3, more"
3 Data 4
- The RegEx patterns consists of two variants
(pattern1|pattern2)
. First one is quoted string and if it doesn't match, unquoted string will be matched. - Demo and full explanation at regex101. (Note that I had to escape
"
for regex101, although it isn't needed for PowerShell. This is because regex101 assumes use from C#. Please upvote this issue if you'd like a true PowerShell flavor for regex101.) - This isn't a full-fledged CSV parser. If you need to handle edge cases, such as embedded quotation marks, refer to this answer, which presents a more robust solution.