Home > Mobile >  PowerShell: Splitting by comma unless in quotes
PowerShell: Splitting by comma unless in quotes

Time:07-07

I'm trying to parse a string splitting on commas except when comma is inside quotes:

$teststring = 'Data1,Data2,"Data 3, more", Data 4'
$tempsting = $null
$tempsting = $teststring -split ',(?=([^\"]*\"[^\"]*\")*[^\"]*$)'

Write-Host $teststring
$i =0 

Foreach ($object in $tempsting) {
    Write-Host $i " " $tempsting[$i]
    #Write-Host $tempsting[$i]
    $i  
}

Outputs this:

Data1,Data2,"Data 3, more", Data 4
0   Data1
1   Data2,"Data 3, more"
2   Data2
3   "Data 3, more"
4   "Data 3, more"
5    Data 4

What I want is this:

Data1,Data2,"Data 3, more", Data 4
0   Data1
1   Data2
2   "Data 3, more"
3   Data 4

CodePudding user response:

I believe you're looking for a solution that involves manually parsing your string with but just in case going to post this as a PowerShell alternative that is more object oriented and uses ConvertFrom-Csv to parse your string. By no means this is an efficient solution but seems to do what you're looking for.

$string = 'Data1,Data2,"Data 3, more", Data 4'
$parsed = $string, $string | ConvertFrom-Csv
$parsed.PSObject.Properties.Value | & {
    begin   { $i = 1 }
    process {
        $delim = ''
        if($_.Contains(',')) { $delim = '"' }
        ($i  ).ToString().PadRight(5)   [string]::Format("{0}$_{0}", $delim)
    }
}

Outputs:

1    Data1
2    Data2
3    "Data 3, more"
4    Data 4

CodePudding user response:

To complement Santiago Squarzon's helpful answer with a regex-based solution:

# Split the string into fields.
$fields = (
  'Data1,Data2,"Data 3, more", Data 4' -split '([^",] |"[^"]*")' -notmatch '^,?$'
).Trim()

# Output the resulting fields, preceded by their index.
$i = 0; $fields.ForEach({ [pscustomobject] @{ Ndx=($i  ); Value=$_ } })

Output:

Ndx Value
--- -----
  0 Data1
  1 Data2
  2 "Data 3, more"
  3 Data 4

Explanation:

  • The regex passed to the -split operator is somewhat repurposed in that it normally is designed to match what separates the fields, but here matches the fields, and by virtue of using a capture group ((...)), includes the "separator" matches in the output.

  • Thus, the resulting tokens comprise both the "fields" - which in this case are the actual separators - and the "separators" - which in this case are the actual fields.
    -notmatch '^,?$' is then used to filter out the actual separators to leave just the array of actual fields (that is, array elements that are either the empty string or contain just , are eliminated).

As an aside: The -split -notmatch approach above is really a [regex]::Matches() solution in disguise (see zett42's helpful answer). Use of [regex]::Matches() is a more direct expression of the solution's intent, but comes with its own complexity, not least due to not using PowerShell-native features. Bringing the functionality of [regex]::Matches() to PowerShell in the future in the form of a -matchall operator has been green-lighted, but is still awaiting implementation as of this writing: see GitHub issue #7867

CodePudding user response:

mklement0 was faster with the RegEx-based solution. To complement, I'm adding a different variant. Instead of splitting the string, I search for each comma-separated field, which can be either quoted or unquoted.

$string = 'Data1,Data2,"Data 3, more", Data 4'

$i = 0 
[regex]::Matches( $string, '("[^"] "|[^,] )(?:\s*,\s*|$)' ).ForEach{
    ($i  ).ToString().PadRight(5)   $_.Groups[1].Value
}

Output:

0    Data1
1    Data2
2    "Data 3, more"
3    Data 4
  • The RegEx patterns consists of two variants (pattern1|pattern2). First one is quoted string and if it doesn't match, unquoted string will be matched.
  • Demo and full explanation at regex101. (Note that I had to escape " for regex101, although it isn't needed for PowerShell. This is because regex101 assumes use from C#. Please upvote this issue if you'd like a true PowerShell flavor for regex101.)
  • This isn't a full-fledged CSV parser. If you need to handle edge cases, such as embedded quotation marks, refer to this answer, which presents a more robust solution.
  • Related