Problem Background
We have several thousand large (10M<lines) text files of tabular data produced by a windows machine which we need to prepare for upload to a database.
We need to change the file encoding of these files from cp1252
to utf-8
, replace any bare Unix LF sequences (i.e. \n
) with spaces, then replace the DOS line end sequences ("CR-LF", i.e \r\n
) with Unix line end sequences (i.e. \n
).
The dos2unix
utility is not available for this task.
We initially had a bash function that packaged these operations together using iconv
and sed
, with iconv
doing the encoding and sed
dealing with the LF/CRLF sequences. I'm trying to replace part of this bash function with a perl
command.
Example Code
Based on some helpful code review, I want to change this function to a perl
script.
The author of the code review suggested the following perl
to replace CRLF (i.e. "\r\n
") with LF ("\n
").
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
The explanation for why this is better than what we had previously makes perfect sense, but this line fails for me with:
Unrecognized switch: -g (-h will show valid options).
More interestingly, the author of the code review also suggests it is possible to perform the decode/recode in a perl script, too, but I am completely unsure where to start.
Questions
Please can someone explain why the suggested answer fails with Unrecognized switch: -g (-h will show valid options).
?
If it helps, the line is supposed to receive piped input from incov
as follows (though I am interested in learning how to use perl
to do the redcoding/recoding step, too):
iconv --from-code=CP1252 --to-code=UTF-8 $1$ | \
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
> "$2"
(Highly simplified) example input for testing:
apple|orange|\n|lemon\r\nrasperry|strawberry|mango|\n\r\n
Desired output:
apple|orange| |lemon\nrasperry|strawberry|mango| \n
CodePudding user response:
Perl recently added the command line switch -g
as an alias for 'gulp mode' in Perl v5.36.0.
This works in Perl version v5.36.0:
s=$(printf "Line 1\nStill Line 1\r\nLine 2\r\nLine 3\r\n")
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
Prints:
Line 1 Still Line 1
Line 2
Line 3
But any version of perl earlier than v5.36.0, you would do:
perl -0777 -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
# same
BTW, the conversion you are looking for a way easier in this case with awk
since it is close to the defaults.
Just do this:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' <<<"$s"
Line 1 Still Line 1
Line 2
Line 3
Or, if you have a file:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' file
This is superior to the posted perl
solution since the file is processed record be record (each block of text separated by \r\n
) versus having the read the entire file into memory.
(On Windows you may need to do awk -v RS="\r\n" -v OFS="\n" '...'
)
Another note:
You can get similar behavior from Perl by:
- Setting the input record separator to the fixed string
$/="\r\n"
in aBEGIN
block; - Use the
-l
switch so every line has the input record separator removed; - Use
tr
for speedy replacement of\n
with' '
.
Full command:
perl -lpE 'BEGIN{$/="\r\n"} tr/\n/ /' file
CodePudding user response:
The error message is about the command line switch -g
you use in perl -g -pe ...
. This is not about the switch at the regex - which is valid (but useless since there is only a single \n
in a line anyway, and -p
reads line by line).
This switch simply does not exist with the perl version you are using. It was only added with perl 5.36, so you are likely using an older version. Try -0777
instead.