How to split large json file into multiple using PowerShell?-CodePudding

I have big JSON file having 4400000 records (200 MB) in NDJSON format.

Sample Data:

{"index":{}}
{"ip-address":"1.5.0.1","is-vpn":"true","@timestamp":"2022-12-01T18:59:48.8325021 05:30"}
{"index":{}}
{"ip-address":"243.11.0.1","is-vpn":"true","@timestamp":"2022-12-01T18:59:48.8853225 05:30"}
{"index":{}}
{"ip-address":"253.11.0.1","is-vpn":"true","@timestamp":"2022-12-01T18:59:48.8853225 05:30"}
{"index":{}}
{"ip-address":"39.24.0.1","is-vpn":"true","@timestamp":"2022-12-01T18:59:48.8853225 05:30"}
{"index":{}}
{"ip-address":"163.24.0.1","is-vpn":"true","@timestamp":"2022-12-01T18:59:48.8853225 05:30"}

I want to split this into 4 JSON files equally based on the number of rows using powershell.

Output: There will be 4 JSON file containing 1100000 rows each.

Could you please help how to do it using PowerShell.

CodePudding user response：

Since NDJSON is a line-oriented format, plain-text processing will do, so you simply need to create each file with the specified number of lines.

While a cmdlet-based pipeline solution is possible, for performance reasons reading large files in PowerShell is best handled with the switch statement, whereas direct use of .NET APIs, namely System.IO.StreamWriter, is required for efficient writing of files.

Note:

This will create output files Out_01.json through Out_04.json in the current directory - adjust the name template as needed.
- Note that if 4 chunks of 1100000 lines each do not cover all lines of your input file, you'll get additional output files (or fewer, if there are fewer input lines).
Character encoding notes:
- The output files will be BOM-less UTF-8 files by default, though you may specify an encoding explicitly, as noted in the source-code comments below.
- With respect to the input files:
  - switch -File assumes UTF-8 in the absence of a BOM in a given input file, in both PowerShell editions (which is surprising, because only PowerShell (Core) 7 (consistently) defaults to UTF-8, whereas Windows PowerShell's default is usually ANSI), and doesn't allow you specify an input encoding explicitly.
  - If you need to control the encoding for reading the files explicitly, use the System.IO.File.ReadLines .NET API in a foreach loop instead, which allows you to; e.g., to use ANSI encoding:
```
foreach (
  $line in [IO.File]::ReadLines(
    (Convert-Path -LiteralPath $inputFile), 
    [Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.AnsiCodePage)
  )
) {
  # ... work with $line instead of $_
}
```

# This makes the Write-Verbose call below show output.
$VerbosePreference = 'Continue'

# Your input file path.
$inputFile = 'input.json'

# Construct a file-path template for the sequentially numbered 
# output files; e.g., "Out_01.json"
# Note: Be sure to use a *full* path, as that is required for .NET calls.
$outFilePathTemplate = Join-Path $PWD.ProviderPath 'Out_{0:00}.json'

# Set how many lines make up a chunk.
$chunkLineCount = 1100000

# Read the file lazily and save every chunk of $chunkLineCount
# lines to a new file.
$i = 0; $chunkNdx = 0
switch -File $inputFile {
  default {
    if (  $i % $chunkLineCount -eq 1) {
      # Create new chunk file.
      # Close previous file, if any.
      if (  $chunkNdx -gt 1) { $fileWriter.Dispose() }
  
      # Construct the file path for the next chunk, by
      # instantiating the path template with the next sequence number.
      $outFilePath = $outFilePathTemplate -f $chunkNdx
      Write-Verbose "Creating chunk file: $outFilePath"
  
      # Create the next chunk file
      # Note: Default encoding is BOM-less UTF-8
      #       Pass a [System.Text.Encoding] instance to change it.
      $fileWriter = [IO.StreamWriter]::new($outFilePath)
    }
    # Write the line at hand to the current chunk file.
    $fileWriter.WriteLine($_)  
  }
}
$fileWriter.Dispose() # Close the last file.

Alternatively, if you have the Unix split utility at your disposal - by default on Unix-like platforms, installable on Windows or usable via WSL - consider its use instead: not only is such a solution more concise, but it will perform much better:

# Creates output files 'Out_00.json' through 'Out_03.json'

# Requires the GNU implementation of external utility `split`,
# as present on *Linux* systems and in *WSL*
# Numbering starts with 0 - use, e.g., --numeric-suffixes=1 to start with 1.
split --number=4 -d --additional-suffix=.json input.json Out_

# *macOS* equivalent: no support for long option names, and no
# support for --additional-suffix or --numeric-suffixes, so the 
# output files will have *no filename extension* and 
# numbering invariably starts from 0.
split -n 4 -d input.json Out_
# Add extension ".json" to the output files' names.
Get-Item Out_[0-9][0-9] | Rename-Item -NewName { $_.Name   '.json' }

Note that --number=4 specifies the number of output files to create, with split automatically determining how many lines to put in each for even distribution.

If you do want to specify the number of lines per output file, use -l 110000 (--lines=110000) instead.