John Levine from Standcode and the ICANN SSAC, and Ambassador for the Universal Acceptance Steering Group, started off his talk by explaining that putting languages other than English in addresses is a new phenomenon.
Who is using EAI email?
Basically, he explained, EAI is being used by literate computer users who cannot read English characters. He gave India as an example – in the state of Rajasthan, the Indian government is currently handing out email addresses in Hindi (the UASG have produced a case study on this topic).
Development of internationalized characters for email
He provided a brief explanation of the development of non-English characters being used in email. After initially only having the English language-based ASCII character set, the MIME protocol then allowed the display of Unicode characters (with over 1 million characters) in email bodies, but not in addresses.
Unicode characters, which can be of vastly different encoding lengths (from 7 to close to 20 bits), are then recoded as UTF-8 (1-4 bits). This, he explained, means that even if a Unicode string looks the same as an ASCII string, when it is stored in the computer it may well be longer. This means that to be prepared, you need to handle buffer overflows, and make sure that you are prepared for strings that are longer than you are used to. He commented that it is not difficult to prepare yourself, it just needs to be done.
Another complication with Unicode is that there can be several ways to create a Unicode character (e.g. an á can either be encoded as a character in its own right, or as an a followed by an accent). For human readers, this makes no difference to understanding the character, but for computers that can be difficult.
Internationalized Domain Names (IDNs)
IDNs are domain names that can take UTF-8 characters. These can be all Unicode characters or, especially in Europe, a combination of ASCII and Unicode. Levine explained that adding Unicode characters to the domain name system would have created all sorts of problems, so a system was developed in which there were two representations of every UTF-8 domain name. One, the U-label (also known as “native label”), displays with the Unicode characters to be read by humans, and the other (the A-Label) to be read by computers and containing a long string of ASCII characters starting with “xn--”.
So the IDN can provide the domain for an internationalized email address (after the @), but the challenge was then to get Unicode characters in the local domain (before the @). It took over a decade of experimentation to find a way to represent Unicode for the local domain, in the mailbox.
Internationalized email addresses (EAI)
Levine explained that EAI functions like an overlay, like a new separate email system that runs in parallel to the existing email system. He showed that what is critical is whether the recipient mail box is EAI-ready. An ASCII sender can send to either an ASCII or an EAI recipient, but an EAI sender cannot necessarily send to an ASCII recipient. The rules concerning this are extremely complicated (however, if the sender knows that the ASCII recipient is hosted at Gmail or Hotmail, then they can rest assured that it will work fine).
As a result, EAI senders need to be prepared for their email to fail if they are sending to ASCII recipients. In this case, Levine recommended to retry every so often, because even if a mail system does not support EAI now, it may do in the future. As an example, Hotmail and Outlook have only started supporting EAI in the past year, Yahoo intends to begin supporting it in the future.
Implementation of EAI
John Levine explained that the changes required to make an email system EAI- ready are extremely simple. Firstly, this is the inclusion of a new SMTP feature, SMTPUTF8. This allows the recipient to say that it accepts EAI mail. Secondly, the sender system places an SMTPUTF8 tag after the MAIL FROM address. This puts the recipient mail system on warning to check that the recipient has an EAI- compatible email address. The recipient email system also has the possibility to reject EAI mail, even if the system as a whole accepts EAI, depending on whether the specific recipient address can or not.
Advice for writing mail server software
Levine advises programmers first to check and make sure you do not send internationalized email to someone who cannot accept it. In the case of fails, do something sensible: provide a human-readable error message, or, in the case of bulk mails, you might need a different version of the mail for ASCII-only addresses. Further, both U-label and A-label versions of domain names in email addresses should be accepted. Finally, do “fuzzy” matching on incoming addresses – to deal with variations such as upper/lower case or missing accents, similar to the ways mail systems have of dealing with misspellings.
What ESPs and email mailbox providers should consider
When assigning email addresses, ESPs should avoid addresses that people cannot type or that confuse users (for example, if they do not speak the language they are typing in) – the Unicode consortium and the IETF provide guidance on this. Furthermore, ESPs should avoid mailboxes with easily confused local parts (bob, bób, bøb) – do not make these separate addresses, but make them all the same address. (On the other hand, when compiling lists of email addresses, keep in mind that if you come across bob, bób, and bøb, they are probably all the one address, perhaps a misspelling.) Furthermore, most EAI users will also have an ASCII address – you can ask if they will give it to you.
For mailbox providers, Levine again emphasized the need for “fuzzy” matching – allowing mistakes with case, missing accents, or variant characters.
He added that it is not possible to downgrade an EAI message into ASCII without losing information, and it is not possible to respond to a downgraded message. Therefore it is better to put energy into making software EAI-capable than trying to invent non-EAI workarounds.
- Homographs: e.g. Latin O, Cyrillic O & Greek Omicron all look the same, are bit-for-bit identical in some programs. If someone sends an email to B- Omicron-B, send it to Bob. But ESPs should not assign addresses like B- Omicron-B in the first place, because that’s just going to confuse people.
- Two-way Text: Left-to-right vs right-to left text flow. Avoid combining these within an emails address (www and .sa going left-to-right, and Arabic text as the local domain going right-to-left). Even the dot itself can be a left-to-right character.
- Avoid mixed scripts. In theory, an address could combine a Chinese character, and Arabic, Cyrillic, etc., but combining them is bad practice. It is unreadable and impossible to type. While compatible scripts are ok (e.g. the three scripts used to write Japanese), mixed scripts should be treated very skeptically by spam filters.
- Variant characters (e.g. different, simplified version of Chinese characters). Be aware of the problem: just like Bøb above, the variant probably refers to the same person.
- Long domain names – there are top-level domains names as long as 24 characters.
- Several ways to write the same character (is it á or a + ́ ?). If it is possible to combine the elements into a single pre-defined character, it is better to do so.
- Punctuation is possible in local parts – it is allowable, but not advisable.
- It is technically legal to use an emoji in an email address. This should be avoided, because they are not easy to type. An email address must be easy to read and to type. Two different emojis with slightly different skin tones are not easy to differentiate or type.
John Levine finished by saying that EAI is on the way. It is going to be popular, particularly in countries like Thailand and India, where there is a literate population that does not read or write English. And finally, it is not difficult, but it is important to get ready.