[Maia-users] Japanese
Andrew Hamilton
ahamilton at emissary.co.jp
Thu Aug 10 04:48:43 PDT 2006
> The only other snag that comes to mind is that all the applications that
> handle the email along the way need to be UTF-8 aware, too, or else what
> you're storing in the database may not be Unicode at all. There are
> probably implications for Postfix, Perl, amavisd-maia, SpamAssassin, and
> MySQL for starters. If they muck with the encoding of the original
> text, it gets harder to determine what the actual encoding may be, in
> order to re-encode it to UTF-8. This is the area I'm a bit confused
> about myself, and would like to resolve.
I can't tell you about any language besides Japanese, but I do know that it's extremely uncommon to see Unicode-encoded Japanese emails. Probably 95% of Japanese emails are encoded using ISO-2022-JP, which is a 7-bit-safe encoding mechanism that switches between Japanese and ASCII by using escape sequences. This is done so that MTAs and the like will leave the message alone. In fact, the SpamAssassin rules that deal specifically with Japanese text assume that the email will be encoded in ISO-2022-JP.
Typically the way that emails are encoded is:
The mail client that sends the mail transcodes from the System Encoding (these days, this is typically UTF-8) to the Transport Encoding (in the case of Japanese, this is ISO-2022-JP). The Transport Encoding is usually much more difficult to work with, computationally, but it has the advantage of being safe to pass through most mail servers. When the mail arrives at the destination mailbox, the recipient opens the mail, and converts from the Transport Encoding to the System Encoding for display.
What this means in a practical sense, as it relates to Maia et. al., is that any "mail handling" components should not expect the email to be in UTF-8 or any other sort of unicode. So, a "correct" way of dealing with japanese mail (instead of the hacky way that I probably will do it just to please our clients) would be something like:
Store emails in the database exactly the way they are when they pass hrough me (the mail pipeline). Then, *at display time*, examine the content-type and charset headers in the email to determine the encoding of the email. Use this information to convert on-the-fly to UTF-8 for display purposes only, leaving the original mail as-is. Then, if I have to send that email again (to rescue it out of a cache or whatever) I can just dump it back into the mail pipeline in its original encoding...
- Drew
More information about the Maia-users
mailing list