Home > Software design >  Shell-script-internal encoding differs from redirected output
Shell-script-internal encoding differs from redirected output

Time:11-29

Currently I'm dealing with a bunch of old files that have seen a lot of machines, OSes and file systems during their lifetime. A couple of them contain german Umlauts (ä, ö, ü), and apparently these have caused some of the filenames to break in one of the moving processes. A file originally named

München.txt

appears as

M?nchen.txt (invalid encoding)

on the ubuntu system, where they are currently hosted.

So now I'm trying to bulk repair them. On looping through the files with the initial draft, I stumbled across this phenomenon:

  • Echoing to the screen gives me the filename with the question mark, which I understand is a sign of interpretation of an illegal character within the filename:

     ./list_files.sh path_to_files
    
     M?nchen.txt
     K?ln.txt
    
  • If however I save the output to a file, it will give me a binary file that still contains the invalid characters:

     ./list_files.sh path_to_files > file_list
    
     less file_list
     M<FC>nchen.txt
     K<F6>ln.txt
    

This is the code:

#!/bin/bash

rootdir=$1

find "$rootdir" -print0 | while IFS= read -r -d '' broken_file_name; do
    echo $broken_file_name
done

I'm trying to understand:

  1. Why is the screen output different from the one in the file? Where does the character replacement happen and where is the question-mark-thing created?
  2. How can I prevent the interpretation of illegal characters with the question-mark-thing within the process of the script? It prevents me from selectively replacing an illegal character with the corresponding correct one.

CodePudding user response:

The question mark replacement probably happens in Bash itself, as long as you are using Bash echo and try to output characters which cannot be represented in the current locale. It could also be a feature of the terminal driver.

We can only speculate about the original encoding, but the symptoms are consistent with Latin-1 (ISO-8859-1).

Assuming I guessed the encoding correctly, and assuming your current locale is a UTF-8 one, try something like

while IFS= read -r original; do
    dest=$(iconv -f iso-8859-1 <<<"$original")
    mv -- "$original" "$dest"
done <file_list

CodePudding user response:

The different behavior with less is probably a less thing. From the manual:

Control and binary characters are displayed in standout (reverse video). Each such character is displayed in caret notation if possible (e.g. ^A for control-A). Caret notation is used only if inverting the 0100 bit results in a normal printable character. Otherwise, the character is displayed as a hex number in angle brackets. This format can be changed by setting the LESSBINFMT environment variable.

But as what you want is rename your files, the way the names are displayed by various utilities is not that important. In your script you can use, e.g., tr to compute the new name by replacing the characters you do not like by others. Example if you want to replace ö and ü by o and u, respectively:

new=$(tr '\366\374' 'ou' <<< "$old")
if [ "$new" != "$old" ]; then
  mv "$old" "$new"
fi

(366 and 374 are the ascii codes of ö and ü, in octal).

  • Related