04 Apr, 2011, Scandum wrote in the 1st comment:
Votes: 0
It has previously been discussed to use UTF-8 in combination with a custom true-type font. To aid in the adoption of UTF-8 some centralization of knowledge and working practices would be helpful.

As KaVir pointed out recently, UTF-8 can be detected as following for Mushclient.

Quote
Server: IAC DO CHARSET
Client: IAC WILL CHARSET
Server: IAC SB CHARSET REQUEST SEPARATOR "UTF-8" IAC SE

Then it responds with either:

Client: IAC SB ACCEPTED "UTF-8" IAC SE

Or:

Client: IAC SB REJECTED IAC SE


I haven't read up enough on the telopt to be sure if this is a good practice or not, and if it works on an UTF-8 capable terminal emulator using telnet.

As mentioned elsewhere UTF-8 support can be confirmed using MTTS or MSDP's "UTF-8" configurable variable.

To actually switch a terminal in and out of UTF-8 the following escape sequences should be used:

\e%@ 		Select default (ISO 646 / ISO 8859-1)
\e%G Select UTF-8


Some terminals may not display UTF-8 until it's explicitly enabled.
04 Apr, 2011, Kaz wrote in the 2nd comment:
Votes: 0
Scandum said:
Then it responds with either:

Client: IAC SB ACCEPTED "UTF-8" IAC SE

Or:

Client: IAC SB REJECTED IAC SE


Hopefully, it responds with IAC SB CHARSET ACCEPTED … or IAC SB CHARSET REJECTED … instead?
04 Apr, 2011, KaVir wrote in the 3rd comment:
Votes: 0
Kaz said:
Hopefully, it responds with IAC SB CHARSET ACCEPTED … or IAC SB CHARSET REJECTED … instead?

Yeah I typo'd - can't change it now, either.

By the way, CHARSET is covered by RFC 2066.
04 Apr, 2011, Scandum wrote in the 4th comment:
Votes: 0
Too late for me to edit as well.

Does anyone know if the CHARSET telopt is something that only works with Mushclient?
04 Apr, 2011, KaVir wrote in the 5th comment:
Votes: 0
Scandum said:
Does anyone know if the CHARSET telopt is something that only works with Mushclient?

BlowTorch is planning to add it, but I don't know of any other clients.
04 Apr, 2011, Stendec wrote in the 6th comment:
Votes: 0
The CHARSET option is supported by DecafMUD. It doesn't do any of the weird TTABLE stuff, of course, and doesn't have too many different character sets it supports (by default, since you can write plugins to add them). Of course, it's got UTF-8 and that's all that really matters.
05 Apr, 2011, Scandum wrote in the 7th comment:
Votes: 0
What would be the standard character set? ISO 8859-1?

I assume BIG 5 would be the 3rd most common one, though would it be requested as BIG5, BIG 5, or BIG-5?
05 Apr, 2011, quixadhal wrote in the 8th comment:
Votes: 0
I would guess ISO 646 (ASCII) is the safest default. ISO 8859-1 would probably be the second choice, and Windows-1252 would probably be the third most likely. Then you'd move into the language-specific ones. Big 5 is Taiwan (Microsoft CP 950), has it superceded the Chinese and Korean sets?

Remember, what the "standard" is depends on your target audience.
05 Apr, 2011, Scandum wrote in the 9th comment:
Votes: 0
Big 5 is supported by TinTin++ and uses two byte sequences and is very similar to two byte UTF-8. Most of the East Asian traffic for tt++ comes from Taiwan, South Korea, and Hong Kong. I think the Chinese GB 2312 encoding is compatible with ASCII as well, but as Chinese users still have to operate the client in English the language barrier might be the bigger issue. There are probably Chinese mud clients by now as well.

So if I get this straight UTF-8 and BIG-5 are fully compatible with ISO 646 (which is 7 bit ASCII). So a client can always accept ISO 646. If a client is set to UTF-8 or BIG-5 compatibility it can accept UTF-8 or BIG-5, but not ISO 8859-1.

If the client is set to 8 bit ASCII it should reject UTF-8 and BIG-5.


Even if the client handles UTF-8 internally it may not be using a UTF-8 compatible terminal or a Unicode font. As far as I know that issue hasn't been resolved? One workaround would be using "UNICODE" = "1" as an MSDP variable to indicate the user/client thinks a Unicode font is available, with CHARSET to determine if the client can deal with UTF-8 encoding. The distinction will probably be confusing to most people.

Another issue is standardization. Are "UTF-8", "ISO 646", "ISO 8859-1", "Windows-1252", and "BIG-5" what other systems are using, or have they settled for other variants?
05 Apr, 2011, KaVir wrote in the 10th comment:
Votes: 0
Scandum said:
Even if the client handles UTF-8 internally it may not be using a UTF-8 compatible terminal or a Unicode font. As far as I know that issue hasn't been resolved? One workaround would be using "UNICODE" = "1" as an MSDP variable to indicate the user/client thinks a Unicode font is available, with CHARSET to determine if the client can deal with UTF-8 encoding. The distinction will probably be confusing to most people.

I can't think of any useful reason for using UTF-8 without a Unicode font. I'd therefore suggest that the existing "UTF-8" MSDP variable is sufficient - the client should only set it if it's actually able to display Unicode characters. This is also the assumption made by my snippet.

Of course there's no way to know which characters a particular Unicode font supports, either. But unless you know of some reliable way to identify the font, I don't think there's much we can do about that.
05 Apr, 2011, Scandum wrote in the 11th comment:
Votes: 0
I can't really think of a reason either. In TinTin++ UTF-8 is a configurable option, so I'll go ahead and implement CHARSET support as a way to detect the client configuration, CHARSET won't be able to change the encoding handling automatically.
05 Apr, 2011, Scandum wrote in the 12th comment:
Votes: 0
I just looked up the internal variable names on Linux, they are:

"UTF-8", "ISO-8859-1", "BIG5", "CP1252", "ANSI_X3.4-1968"

tt++ will probably respond to UTF-8 and BIG5.

I don't see a big point in responding to ASCII variants as 7 bit is always valid. Some generic way to indicate single byte 8 bit emulation would be useful, "ASCII" comes to mind as the pseudo-official best match.

I think it'd be best for tt++ to report whatever Linux terminal emulators do by default, I assume it's typically ISO-8859-1, assuming that the trend is to use either ISO-8859-1 or UTF-8. Does anyone know?

If Quix is right that the target audience makes the standard I'd suggest using "ASCII" to indicate single byte operation, with the specific 8 byte table left unknown. So that'd make UTF-8 and ASCII for mud servers to query, and Asian servers could try BIG5 among others.
06 Apr, 2011, oenone wrote in the 13th comment:
Votes: 0
Scandum said:
Another issue is standardization. Are "UTF-8", "ISO 646", "ISO 8859-1", "Windows-1252", and "BIG-5" what other systems are using, or have they settled for other variants?


In Europe, you see "ISO 8859-15" ("Latin-9") used, too. It's actually the same as "ISO 8859-1" ("Latin-1"), with very few characters replaced.. For example the (Euro).
06 Apr, 2011, Scandum wrote in the 14th comment:
Votes: 0
oenone said:
In Europe, you see "ISO 8859-15" ("Latin-9") used, too. It's actually the same as "ISO 8859-1" ("Latin-1"), with very few characters replaced.. For example the (Euro).

My biggest issue with TinTin++ is detecting the actual character set. Linux systems all seem to be set to UTF-8 by default, I'm not entirely sure how non UTF-8 terminal emulators handle 8 bit characters. So my solution is to only report the encoding, 1 byte (ASCII), 2 byte (BIG5), multi-byte (UTF-8).

From what I gathered (through tt++ bug reports) German Linux systems already use UTF-8 for 8 bit characters. So that'd indicate that ISO 8859 is a thing of the past.
0.0/14