Currently I'm dealing with a bunch of old files that have seen a lot of machines, OSes and file systems during their lifetime. A couple of them contain german Umlauts (ä, ö, ü), and apparently these have caused some of the filenames to break in one of the moving processes. A file originally named
München.txt
appears as
M?nchen.txt (invalid encoding)
on the ubuntu system, where they are currently hosted.
So now I'm trying to bulk repair them. On looping through the files with the initial draft, I stumbled across this phenomenon:
Echoing to the screen gives me the filename with the question mark, which I understand is a sign of interpretation of an illegal character within the filename:
./list_files.sh path_to_files M?nchen.txt K?ln.txt
If however I save the output to a file, it will give me a binary file that still contains the invalid characters:
./list_files.sh path_to_files > file_list less file_list M<FC>nchen.txt K<F6>ln.txt
This is the code:
#!/bin/bash
rootdir=$1
find "$rootdir" -print0 | while IFS= read -r -d '' broken_file_name; do
echo $broken_file_name
done
I'm trying to understand:
- Why is the screen output different from the one in the file? Where does the character replacement happen and where is the question-mark-thing created?
- How can I prevent the interpretation of illegal characters with the question-mark-thing within the process of the script? It prevents me from selectively replacing an illegal character with the corresponding correct one.
CodePudding user response:
The question mark replacement probably happens in Bash itself, as long as you are using Bash echo
and try to output characters which cannot be represented in the current locale. It could also be a feature of the terminal driver.
We can only speculate about the original encoding, but the symptoms are consistent with Latin-1 (ISO-8859-1).
Assuming I guessed the encoding correctly, and assuming your current locale is a UTF-8 one, try something like
while IFS= read -r original; do
dest=$(iconv -f iso-8859-1 <<<"$original")
mv -- "$original" "$dest"
done <file_list
CodePudding user response:
The different behavior with less
is probably a less
thing. From the manual:
Control and binary characters are displayed in standout (reverse video). Each such character is displayed in caret notation if possible (e.g. ^A for control-A). Caret notation is used only if inverting the 0100 bit results in a normal printable character. Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable.
But as what you want is rename your files, the way the names are displayed by various utilities is not that important. In your script you can use, e.g., tr
to compute the new name by replacing the characters you do not like by others. Example if you want to replace ö and ü by o and u, respectively:
new=$(tr '\366\374' 'ou' <<< "$old")
if [ "$new" != "$old" ]; then
mv "$old" "$new"
fi
(366 and 374 are the ascii codes of ö and ü, in octal).