mojibake

    [03:52:18] [17/29/69] (openmic): 今天是白茶。 - 大红袍缺货了,抱歉。
    [tob]      Oh now THAT is intersting.
    [tob]      The metadata.

The interest was probably on account of anonradio misbehaving, though the wacky metadata is more likely due to some track having data in encoding A being displayed in encoding B, unrelated to other problems with the system. Complex systems can exhibit multiple unrelated failures at the same time! Encoding problems is one of the few places where including a screenshot helps, as different viewers will see different things if their software assumes different encodings than that of others.

Link

Another problem is that copy and paste may mangle the data, e.g. some software will helpfully normalize Unicode or does whatever other transformations it is wont to do, e.g. Microsoft Excel corrupting gene names by default. Sharing the data in this way may change it, which complicates debugging, as what the debuggers are presented with may differ from the actual bytes of the original problem. Take screenshots, and try to save the original data unmolested somewhere.

So how do we figure out the original encoding was, and how was it being displayed? Brute forcing all the many encodings may take too much time, but one can probably make educated guesses. Unix software tends to use UTF-8 or ISO-8859-1, and this is a unix system we are dealing with. Other operating systems have their own weird preferences for encoding, good information to include in a bug report. Another fact is that there is quite a lot of Japanese music in the anonradio "autodj" stream, so one hypothesis is that the text is being displayed as UTF-8 or ISO-8859-1 but is encoded with SHIFT-JIS or some other Japanese or maybe Asian encoding. This narrows the search list quite a bit.

    $ iconv -l | wc -l
         197

A next step would be to (try to) isolate the original bytes, unmolested, into a file that can be poked at or viewed by different tools.

Link

    $ wc -c blah
         100 blah
    $ file blah
    blah: Non-ISO extended-ASCII text
    $ hexdump -C blah
    00000000 c3 a4 c2 bb c2 8a c3 a5  c2 a4 c2 a9 c3 a6 c2 98 ................
    00000010 c2 af c3 a7 c2 99 c2 bd  c3 a8 c2 8c c2 b6 c3 a3 ................
    00000020 c2 80 c2 82 20 2d 20 c3  a5 c2 a4 c2 a7 c3 a7 c2 .... - .........
    00000030 ba c2 a2 c3 a8 c2 a2 c2  8d c3 a7 c2 bc c2 ba c3 ................
    00000040 a8 c2 b4 c2 a7 c3 a4 c2  ba c2 86 c3 af c2 bc c2 ................
    00000050 8c c3 a6 c2 8a c2 b1 c3  a6 c2 ad c2 89 c3 a3 c2 ................
    00000060 80 c2 82 0a                                      ....
    00000064
    $ iconv -f SHIFT-JIS -t ISO-8859-1 < blah
    iconv: (stdin):1:0: cannot convert
    $ iconv -f SHIFT-JIS -t UTF-8 < blah
    テ、ツサツ甘・ツ、ツゥテヲツ伉ッテァツ卍スティツ個カテ」ツ
    iconv: (stdin):1:33: cannot convert

file(1) is probably guessing wrong here, though maybe this is some wacky encoding; many different things were done with the bytes outside the ASCII range, especially in the more Easterly portions of Europe: iron curtains, covert computer imports, isolated population groups, different languages.

Guessing SHIFT-JIS turns up Japanese characters, though iconv(1) gives up at some point. Also 」 may be missing a corresponding halfwidth right corner bracket (such are used to 「quote」 text in Japanese) so perhaps the tag is corrupt in some way unrelated to the mojibake problem? Or our conversion is wrong and produces the bracket accidentally, or some other corruption has left a stray bracket in the text, or several of these problems all together. Mojibake may be like a geologic layer that has been submerged, heated, folded, uplifted, and eroded: working back to the original may take a bit of effort.

    $ whatchar 「」「」
    [「] Ps U+300C LEFT CORNER BRACKET
    [」] Pe U+300D RIGHT CORNER BRACKET
    [「] Ps U+FF62 HALFWIDTH LEFT CORNER BRACKET
    [」] Pe U+FF63 HALFWIDTH RIGHT CORNER BRACKET
    $ printf '」' | iconv -f UTF-8 -t SHIFT-JIS | od -a
    0000000   a3                                                            
    0000001

Since iconv goes off the rails at or after the 33rd character, we can test whether the conversion works at or after that point. Another approach would be to remove the problematic byte or bytes from a copy of the data, and to retry the iconv. Repeat until too many bytes are removed, or iconv coughs up something.

    $ perl -nle 'print $1 if m/^.{32}(.*)/' blah | iconv -f SHIFT-JIS -t UTF-8
    ツ
    iconv: (stdin):1:1: cannot convert
    $ perl -nle 'print $1 if m/^.{33}(.*)/' blah | iconv -f SHIFT-JIS -t UTF-8
    iconv: (stdin):1:0: cannot convert
    $ perl -nle 'print $1 if m/^.{34}(.*)/' blah | iconv -f SHIFT-JIS -t UTF-8
    ツ
    iconv: (stdin):1:1: cannot convert
    $ perl -nle 'print $1 if m/^.{35}(.*)/' blah | iconv -f SHIFT-JIS -t UTF-8
    iconv: (stdin):1:0: cannot convert
    $ perl -nle 'print $1 if m/^.{36}(.*)/' blah | iconv -f SHIFT-JIS -t UTF-8
     - テ・ツ、ツァテァツコツ「ティツ「ツ催ァツシツコティツエツァテ、ツコツ

These look like (possibly corrupted somehow, maybe via double encoding?) song title or lyrics, though how the data might be corrupted or the details of the Japanese is beyond my pay grade. A reverse approach may also be useful for debugging: take something properly encoded in UTF-8, convert it to SHIFT-JIS, then view those bytes through a UTF-8 or ISO-8859-1 lens:

Link

    $ cat wa
    「は」ってなんですか
    $ od -h wa
    0000000     80e3    e38c    af81    80e3    e38d    a381    81e3    e3a6
    0000020     aa81    82e3    e393    a781    81e3    e399    8b81    000a
    0000037
    $ iconv -f UTF-8 -t SHIFT-JIS < wa > jis
    $ cat jis
    úvĂȂ
    $ file jis
    jis: Non-ISO extended-ASCII text

file(1) guesses the same non-ISO as above, so a good hypothesis is that the original data is in SHIFT-JIS, possibly is corrupted in some way, and is being run through a UTF-8 (or, untested, ISO-8859-1) lens. It may be helpful to have a terminal that uses various wacky encodings so that one can quickly see what encoding A looks like if viewed using encoding B, and if you're really into mojibake to learn the low-level details of the encodings so what looks like "junk bytes" to me might have an easy explanation.


Source