I'm having a weird issue with using PowerShell to merge multiple csv files into one. I've done this plenty of time in the cmd prompt on in windows 7, but here the output only contains the earliest file. The command is standard stuff:
C:\> copy *.csv output.csv
All I am getting is, as I say, the earliest csv copied into this new file but nothing else. Is this an issue with powershell vs simple cmd prompt?
Thanks Michael
CodePudding user response:
As noted by lit in the comments, in PowerShell copy
is a built-in alias of the Copy-Item
cmdlet, which functions differently from cmd.exe
's internal copy
command:
As of PowerShell 7.2.1,
Copy-Item
does not support merging multiple files into a single destination file. See the bottom section for a - potentially content-modifying -Get-Content
solution.Currently, if
Copy-Item
's-Destination
argument (the second positional argument,output.csv
in your case) is a file, all-Path
arguments (the first positional argument,*.csv
in your case) are sequentially copied to the same destination file - in other words: the last file that matches wildcard pattern*.csv
"wins", andoutput.csv
is simply a copy of it alone - see GitHub issue #12805 for a discussion.
To use cmd.exe
's copy
command, which merges the input files to form the destination file, call via cmd /c
:
cmd /c 'copy /y /b *.csv output.csv'
Caveat: As discussed in aschipfl's helpful answer, how a preexisting output.csv
file is handled depends on whether output.csv
happens to be the first file matched by wildcard pattern *.csv
or not. Either use the workaround proposed there, or simply ensure that no output.csv
file is present beforehand.
Note the addition of:
/y
, which suppresses a confirmation prompt if the destination file already exists/b
, which copies in binary mode, which prevents an "EOF character" (the Substitute character,0x1a
, which you can interactively produce withCtrl-Z) from being appended to the destination file.
As an aside: on Unix-like platforms you could use sh -c 'cat *.csv > output.csv'
, but there you'd always have to first ensure that there's no preexisting output.csv
file, as that would result in an endlessly growing file.
Alternatively, you may use the Get-Content
cmdlet to merge multiple text files, as proposed by lit and refined by zett42 in the comments on the question, but doing so can change the character encoding and newline format, which may or may not be desired in a given use case:
# !! Caveat: may change character encoding and newline format.
# !! -Encoding utf8 used as an example.
Get-Content *.csv -Exclude output.csv | Set-Content -Encoding utf8 output.csv
Get-Content
, assuming it interprets a text file's encoding correctly (based on a file's BOM, if present, and assuming a default otherwise), loads a file's lines into .NET strings, and the information about the file's character encoding is not preserved.Similarly, file-writing cmdlets such as
Out-File
(and its effective alias>
) andSet-Content
operate on .NET input strings and use a default encoding when saving to a file - though a different encoding may be requested via the-Encoding
parameter.In other words: If your input file had a consistent, non-default encoding that you want to preserve in the destination file, you (a) need to know what that encoding is and (b) request its use via
-Encoding
.Note: Windows PowerShell defaults to the system's legacy ANSI code page for
Get-Content
andSet-Content
, and to UTF-16LE ("Unicode") forOut-File
/>
. By contrast, PowerShell (Core) 7 now commendably uses (BOM-less) UTF-8, consistently across all cmdlets.
Additionally, because files are read line by line by
Get-Content
by default, information about the specific newline format is lost. The file-saving cmdlets then use the platform-native newline sequence (CRLF ("`r`n"
) on Windows, LF ("`n"
) on Unix-like platforms), so the destination file may end up with a different newline format. Also, the information as to whether a given file had a trailing newline is lost.At the expense of having to read each file into memory in full (which normally isn't a problem with text files), you can preserve the original newline format and trailing-newline status by combining
Get-Content -Raw
withSet-Content -NoNewLine
:Get-Content -Raw *.csv -Exclude output.csv | Set-Content -Encoding utf8 -NoNewLine output.csv
As for use cases:
You can use
Get-Content
Set-Content
for one or more of the following scenarios:If your input files are text files that use varying character encodings (all of which
Get-Content
needs to be able to recognize), so as to create a consistently encoded destination file.Similarly, even if the input files have the same encoding, you can choose to transcode the content, i.e. to choose a different encoding for the destination file.
If you want to normalize the newline format to the platform-native format and possibly also to ensure the existence of a trailing newline.
Otherwise, if the input files' content must be preserved as-is - which is especially true for binary files - use the
cmd /c 'copy ...'
approach.- Solving this in PowerShell would require nontrivial use of lower-level .NET APIs.
CodePudding user response:
The copy
command of cmd.exe
concatenates files when you specify multiple source files and a single destination file just as you do, so the command line:
copy *.csv output.csv
should concatenate all files matching the source pattern *.csv
in the order they are reported by the file system (hence in something like alphabetic order on NTFS). You should append the /B
option at the end to mark the destination file as binary and therefore avoid an end-of-file character (0x1A
) to become appended.
But as soon as output.csv
exists, the result depends on the position it is returned by the the search against the source mask *.csv
:
output.csv
is the first item: all the other files become appended tooutput.csv
, given that the overwrite prompt is confirmed, or the/Y
switch is added;output.csv
is not the first one: all the other files become concatenated tooutput.csv
, the originaloutput.csv
file seems to be ignored;
This has been tested on Windows 10, also considering the console output of copy
, which lists all copied files, but excludes output.csv
unless it is at the first position.
So to get a result independent from the file system, you could simply explicitly specify the first item (namely nul
, the null device, providing no data) and use the
operator supported by copy
:
copy /Y nul *.csv output.csv /B
This ensures that a potentially already existing file output.csv
is never the first one in the list of files to become copied and therefore being ignored for concatenation.