Command Line Magic

 

  Martin's virtual lair

In this article I'll describe some Linux command-line tools to analyze and convert text files between different character encodings. I'll also touch on the different newline encodings of the major operating systems. Finally, we'll have a look at "forensics" on files with corrupted character encodings and how to repair them.

People who only work with English texts on their computer don't know how lucky they are: Their language is fully covered by the limited range of unambiguous ASCII characters, and seldom do conversions of plain English texts between different character encodings (such as US-ASCII, ISO 8859-1 also known as Latin-1, or UTF-8) cause any problems. Unfortunately, there does not exist any gold standard for encodings. The closest to such a gold standard, particularly for western languages, is UTF-8. For various reasons UTF-8 is however not the best possible encoding for East-Asian languages (Chinese, Japanese, Korean), which is why websites written in these languages are disproportionately affected by the problem of garbled output. For more information about the background and history of encodings, see this excellently written article. The most important bit is that "There Ain't No Such Thing As Plain Text", meaning that every text written on a computer depends on an encoding.

Under Linux the following command-line tools are useful when dealing with character encodings:

  • Determine file type and encoding: file
  • Display individual characters of file: hexdump
  • Convert between different character encodings: recode, iconv
  • Replace or delete individual characters: tr
  • Convert between newline encodings of different operating systems: dos2unix

 

Determine file type and encoding

We can use the command-line tool file to determine the file type and character encoding. A sane text file should reply to

file -bi textfile.txt

with something like this:

text/plain; charset=us-ascii

Other common encodings (or "charset") are iso-8859-1 and utf-8. On the other hand, you have a problem if you get the following:

application/octet-stream; charset=binary

In this case, file couldn't figure out the character encoding of the text file and assumed that it is a binary file, i.e. not a text file at all. To investigate this further, it is often helpful to open the file with a text editor. If only some of its contents are displayed garbled, then it makes sense to analyze the file on the code level with tools such as hexdump. More on that later.

 

Convert between different character encodings

To convert text files from one encoding to another, we can use the recode command. For example, the following command will convert an ISO-8859-1 file to UTF-8:

recode ISO-8859-1..UTF-8 textfile.txt

Of course, you need to know beforehand that the input file is in fact encoded as ISO-8859-1. Also note that conversions from larger character sets to smaller ones (e.g. from UTF-8 to ISO-8859-1) will inevitably cause problems if the input file contains unusual characters that are not part of the smaller character set. To see what encodings recode handles, you can display a list with 'recode -l'.

 

Validation of encodings

Although the primary purpose of iconv is to convert between different character encodings, it can also be used to validate the compliance of text files to a certain encoding. For example, we can verify whether a text file is encoded as UTF-8 with the following command:

iconv -f UTF-8 textfile.txt -o /dev/null ; echo $?

This command attempts to convert the input file from UTF-8 into UTF-8, discards the resulting output in the virtual trash bin, and prints only the return value on the screen. A return value of 0 stands for success, while 1 stands for error. Since the conversion only succeeds if the file is UTF-8 in the first place, the output 0 means that the file is properly UTF-8 encoded.

 

Convert between newline encodings of different operating systems

As if different character encodings weren't bad enough, different operating systems also use different characters to mark new lines within text files. In Unix, Linux and Mac OS X new lines are encoded by the line feed (\n) character. In Windows and DOS new lines are encoded by a carriage return followed by a line feed (\r \n).

To determine the newline character encoding of an unknown text file, we can use the hexdump utility. Let's say, we have the following text file:

Hi,

You OK?

Depending on whether the file was created under Unix or Windows, hexdump will yield different output for the file's individual characters and their octal code values:

$ hexdump -bc unix.txt
0000000 110 151 054 012 012 131 157 165 040 117 113 077 012
0000000 H i , \n \n Y o u O K ? \n
000000d
$ hexdump -bc windows.txt
0000000 110 151 054 015 012 015 012 131 157 165 040 117 113 077 015 012
0000000 H i , \r \n \r \n Y o u O K ? \r \n
0000010

If a text file created under Linux is opened in Notepad under Windows, the lines typically go on endlessly, because Windows does not detect any newline characters. On the other hand, Linux applications are much smarter at recognizing and handling different newline encodings, so you are often unaware of what newline encoding you are currently working with under Linux.

Converting newline encodings from Windows to Linux is very simple with the dos2unix utility:

dos2unix windows.txt

This utility also provides the self-explanatory unix2dos command.

 

Forensics: how to repair corrupted encodings

Let's look at a real-life example I encountered when publishing my late father's undergraduate Analysis lectures as a book. The lecture notes were written between 1992 and 2003 as LaTeX files in various proprietary editors under MS-DOS and Windows 9x, and contained German special characters (the umlauts ä, Ä, ö, Ö, ü, Ü, and ß). Later, those files would be burned to DVD, and yet later, they would be transferred from DVD to Linux. At some point during the whole process, the encoding of the special characters got badly corrupted.

Opening the files with Emacs, I immediately noticed that the German special characters were not displayed properly. Using hexdump, I established which character codes (in octal values) appeared in place of the correct letters. The table below shows the correct octal values in ASCII and UTF-8 for each special character along with the code values that instead appeared in the corrupted files. For example, the letter 'ä' would be incorrectly represented as \204 and in other instances as \342 \200 \236.

letter	ASCII	UTF-8		corrupted code values

ü \374 \303 \274 \201
Ü \334 \303 \234 \232
ä \344 \303 \244 \204 or \342 \200 \236
Ä \304 \303 \204 \216
ö \366 \303 \266 \224 or \342 \200 \235
ß \337 \303 \237 \341

Luckily, as can be seen from an ASCII chart, none of the wayward code values in the rightmost column corresponds to a character commonly used in German. This allows us to replace all incidences of corrupted code values with the correct values, making batch processing feasible.

In my case, I first used sed to convert the corrupted code values into their correct ASCII values. For example, the command to repair the corrupted occurrences of 'ä' would be:

sed -i 's/\o204/\o344/g; s/\o342\o200\o236/\o344/g' textfile.txt

Why use the Swiss army knife sed instead of the more specialized tr tool? Although tr has a very clear syntax, it is somewhat limited in its capabilities and only allows to replace or delete single code values, not consecutive ones. For the record, the tr command to perform the first of the two replacement operations done with sed above is:

tr "\204" "\344" <textfile.txt >outputfile.txt

After performing this code conversion for all German special characters and deleting an invalid character which for some reason was present at the very end of the text file, my file conformed to the ISO-8859-1 encoding standard:

$ file -bi textfile.txt
text/x-tex; charset=iso-8859-1

From here on it was smooth sailing. Using recode, I converted my file to the "gold standard" UTF-8:

recode ISO-8859-1..UTF-8 textfile.txt

Lastly, I verified that the conversion was successful:

$ iconv -f UTF-8 textfile.txt -o /dev/null && echo $?
0

With simple batch processing I repaired the corrupt encoding present in several hundred pages of text and converted it to valid UTF-8!