19 Mar, 2009, elanthis wrote in the 1st comment:
Votes: 0
I am a firm believer that clients should do basic word wrapping, including maintaining indentation. Few clients bother because servers don't send unwrapped content, unfortunately, so I find myself needing to add wrapping into the server itself.

I'm having a little trouble coming up with the easiest way to get this done. My current code buffers up output until a space is found (or the word buffer fills), checks the width of the buffer against the remaining screen space (as reported by NAWS, or a default of 70), and either spits out the word or spits out a newline followed by the current indentation followed by the word.

It works but it's a woefully incomplete. It doesn't deal with embedded colors at all (not a problem for my MUD, as I never ever change colors mid-word, but other MUDs might want the feature). It assumes one byte == one character == one character cell. It only breaks on spaces.

Fixing some of that I can do with a little work. I can have it decode UTF8 sequences if a Unicode flag is set. I can alter the algorithm to break after hyphens and before punctuation that doesn't end a word. The trick however is dealing with true Unicode text properties. Some Unicode codepoints represent zero-width characters. Some are double-width. The line-break rules of non-Latin alphabets are totally unknown by me.

The best I can find is ICU, but that is a freaking HUGE dependency to pull in. Does anyone know of any alternative, light-weight Unicode-capable libraries (or even just data tables) I can use for this?

Also, is anyone else interested in having a small library for dealing with word wrapping? (Would be usable in a client too.)
19 Mar, 2009, David Haley wrote in the 2nd comment:
Votes: 0
Don't know of any libraries, but it would be cool to have a library to do this. It would be nice to be able to pass in a function that defines how to count the number of visible characters (and default to one byte == one visible character); that way, you could keep the library non-MUD-specific.
19 Mar, 2009, Vassi wrote in the 3rd comment:
Votes: 0
I didn't realize most clients didn't wordwrap. I'd often wondered why NAWS was a big deal, since word-wrapping is a simple boolean feature on a rich text control.
19 Mar, 2009, elanthis wrote in the 4th comment:
Votes: 0
Many clients don't use a rich text control. Especially console-only clients.

I personally need more than just word-wrapping, I need the indentation support too. With controls, at that… if I indent the start of a sentence I don't want the whole sentence indented, but if I decide I want a block of text indented 4 spaces, I want the whole thing to be indented even across automatic line breaks. pita.

David, you think I should give a custom decoding function on top of a custom break-point detection function? Is there any value in supporting anything besides one-byte-only 8-bit encodings (ISO-8859-1, win-1252, etc.) and UTF-8?
20 Mar, 2009, David Haley wrote in the 5th comment:
Votes: 0
Well, to be internationally-conscious, you'd need to be able to support those charsets. I'm not sure, to be entirely honest, of what UTF-8 can/can't represent as opposed to full unicode – I haven't done much work there. I think it can't represent completely different charsets e.g. Chinese. If it's not too terrible, it would be nice to be able to support those. At least, being able to support common "Western" languages like Spanish, French, German, etc. would be quite nice. I'll give it a think and maybe say more tomorrow, after getting some sleep. :wink:
20 Mar, 2009, Les wrote in the 6th comment:
Votes: 0
David Haley said:
At least, being able to support common "Western" languages like Spanish, French, German, etc. would be quite nice. I'll give it a think and maybe say more tomorrow, after getting some sleep.


ISO-8859-1, -2, -3, etc are all single byte encodings so if it works for one of them it will work for all. UTF-8 is able to represent any character in the Unicode character set albeit using potentially up to 4 bytes per character to do so.

So since UTF-8 is universal as far as Unicode is concerned if you only support the single byte encodings and that, you've covered all your bases.
20 Mar, 2009, elanthis wrote in the 7th comment:
Votes: 0
To repeat Les, UTF-8 is a variable-length multi-byte encoding for Unicode, and can support ANY Unicode codepoint. It is tricky because the number of bytes used for a character vary by high the codepoint value is. ASCII values in UTF-8 are encoded as a single byte of equal values while non-ASCII values never include any byte that is equal to any ASCII value, which was the motivation behind UTF-8 – adding Unicode support to software that largely only dealt with ASCII/single-byte strings.

UTF-16 is probably what you're thinking of David. It can represent the entire BMP ("basic multilingual plane", iirc) but not the higher codepoints. While in practice nothing outside of the BMP is actually in use, it does mean that fully correct software has to deal with surrogate pairs which is kind of a pain. UTF-32 just uses a full 32-bit value and can represent any Unicode codepoint with a single element, but for common text takes up 2x as much memory as UTF-16. Both UTF-16 and UTF-32 require endianness markers too, which is a pain. UTF-8 represents common English and most Western European languages in less space than either UTF-16 or UTF-32, but many Asian and African languages use codepoints that require 3 bytes in UTF-8 which is 50% chunkier than UTF-16.

UNIX and the Internet tends to prefer UTF-8 solely because it's compatible with existing text-oriented and byte-oriented protocols (e.g., HTTP, SMTP, TELNET, IMAP, etc.) and because old software can for the most part work without modification on UTF-8 text. UTF-16 is preferred by Windows and Java, which were "early adopters' with Unicode. UTF-32 is rarely used, although most Unicode-aware software does use a 32-bit int instead of char to represent individual characters for obvious reasons.

The only thing that software using UTF-8 needs to be aware of is the stuff I mentioned explicitly for word wrapping: counting _character_ instead of counting _bytes_. If you're just parsing and printing text you don't need to be aware of UTF-8 at all; at most you might just want to be careful when truncating strings (e.g. with snprintf) so that you don't leave half of a UTF-8 extended character sequence at the end of the buffer, but even then it's not really any bigger deal of a deal than the lost characters you'd get from truncating any ASCII string, thanks to how UTF-8 was designed.

Basically instead of doing something like:

/* iterate over each character in an ASCII/ISO-8859-* string */
while (buffer[index] != 0) {
char c = buffer[index++];
blah;
}


you need something more like:

/* get next Unicode character from a utf8 string */
int utf8_get_next(char **ptr) {
if (**ptr & 0x80) {
int decode;
/* utf8 decoding logic here, which results in *ptr pointing to the position following
* the last byte of the UTF8 character sequence, which automatically deals with
* NUL terminators
*/
return decode;
} else {
return *(*ptr++);
}
}

/* iterate over each character in a UTF-8 string */
char *ptr = buffer;
while (*ptr != 0) {
int c = utf8_get_next(&ptr);
blah;
}


And you can make code that deals with either like so:

char *ptr = buffer;
while (*ptr != 0) {
int c = is_utf8 ? utf8_get_next(&ptr) : *(ptr++);
blah;
}


That covers the character counting/iteration at least. The word-wrapping part is where it gets hellacious. Even for Western European text, you need to be able to identify:

* non-breaking spaces and non-breaking hyphens
* combining characters that have no width
* zero-width spaces
* regular line-break points (which is complicated enough on its own)

Then of course you have the question of the breaking algorithm. The simple algorithm can result in ugly line breaks in some cases. There is the method originally designed for Tex which has far better results, but requires that you process whole lines/paragraphs at a time. That means buffering up all your text until a hard newline and _then_ processing it. Not a big deal, but something to be taken into account. Plus dealing with indentations… also not a big deal, but something to take into account.

The general algorithm for simple line breaking goes something like:

process_char(int c) {
if (c == '\n') {
flush_buffer();
print_hard_nl();
return;
} else if (is_space©) {
if (!soft_space) {
flush_buffer();
print_char©;
}
return;
} else if (is_break_point©) {
flush_buffer();
print_char©;
}

/* deal with overflow */
buffer[buffer_length++ ] = c;
}

flush_buffer() {
if (soft_space)
print_indent();
soft_space = false;
if (buffer_lenght + output_pos > width)
print_soft_nl();
print(buffer, buffer_length);
buffer_length = 0;
}

print_soft_nl() {
print_char('\n');
soft_space = true;
output_col = 0;
}

print_hard_nl() {
if (!soft_space)
print_char('\n')
soft_space = false;
output_col = 0;
}

/* rest should be self explanatory */


The soft space stuff is possibly the biggest trick. The idea is that if you take text like "bigword anotherword" and the right-edge of the screen aligns with the very end of "bigword", you don't want to end up printing something like:

bigword
anotherword


So when the code inserts a line break it needs to ignore all following spaces until the next word or the next newline. All the next newline does is disable the soft space flag, because if you explicitly put spaces after a newline you obviously wanted the indentation.

Indentation is also a bit of a trick. It's not safe to assume that the manual indentation on the start of the line means you want wraps to have the same indentation. You may just want to indent the start of a paragraph, for example. Unicode has no way of embedding indentation markers (that I know of), and even if it did those wouldn't work with ASCII/ISO-8859-* text. So you need a separate interface for defining indentations.

On top of that, the above code can't deal with ANSI/VT100 escape sequences. Aside from needing to add logic to find and parse out the sequences (color codes add no width), you still have to parse the _meaning_ of the control codes. The clear screen code resets output_pos and soft space. Cursor control affects output_pos and needs parsing to find the new value. etc.

Source MUD deals with this specially. The code never sends \e

So when the code inserts a line break it needs to ignore all following spaces until the next word or the next newline. All the next newline does is disable the soft space flag, because if you explicitly put spaces after a newline you obviously wanted the indentation.

Indentation is also a bit of a trick. It's not safe to assume that the manual indentation on the start of the line means you want wraps to have the same indentation. You may just want to indent the start of a paragraph, for example. Unicode has no way of embedding indentation markers (that I know of), and even if it did those wouldn't work with ASCII/ISO-8859-* text. So you need a separate interface for defining indentations.

On top of that, the above code can't deal with ANSI/VT100 escape sequences. Aside from needing to add logic to find and parse out the sequences (color codes add no width), you still have to parse the _meaning_ of the control codes. The clear screen code resets output_pos and soft space. Cursor control affects output_pos and needs parsing to find the new value. etc.

Source MUD deals with this specially. The code never sends \e[2J in any text stream, it just calls a clear_screen() method. That resets the word wrapping state entirely and then buffers the text. Color currently is in the text stream (which is gross and shall be fixed) which requires a parser on top of the word wrapping, which many other MUDs do anyway for their custom color code escape sequences. (Source MUD just uses \e as its custom marker instead of { or something because \e is stripped from player input and is never found in any data files.) That parser just flushes the wordwrap buffer on receipt of an escape sequence, which essentially means that any color changes are line-break points. Since I never, ever change color mid-word, that isn't a problem for my code. For MUDs designed by retarded chimpanzes that spit out a different color for every letter of a word (or do stupid color striping or similar visually-jarring and distracting effects for poorly-thought-out reasons) that kind of code will cause mid-word breaks. Not sure I care (those MUDs are dumb), but worth noting.

This is all handled in detail and with far greater accuracy by any modern graphical text positioning toolkit, e.g. GTK+'s Pango. If you are writing a client and it isn't a console-only client, I do highly recommend using your toolkits existing text layout facilities, which should work just fine with monospace text too. For people wanting to add word wrapping to servers or to console-only clients, your options currently are limited to a half-assed implementation of the above, a very time-consuming and still very limited implementation of the above, or the ICU library which is very powerful and correct (it's what many GUI toolkits use internally).

If I push out a libwordwrap, it'll be a limited implementation of the above, only suitable for Western European text. I may end up taking it further and adding the Tex line break algorithm, but I doubt it.

It may perhaps be worthwhile to add a way to negotiate whether word wrapping is in effect on a client, too. Word wrapping can add up to a good deal of memory and CPU usage and it's far better to distribute that workload out to all the clients instead of making the server do it for everyone. Something as simple as:

(client supports word wrapping)
server: IAC DO WORDWRAP
client: IAC WILL WORDWRAP

or

(client does not support word wrapping)
server: IAC DO WORDWRAP
client: IAC WONT WORDWRAP
20 Mar, 2009, David Haley wrote in the 8th comment:
Votes: 0
Hmph, seems like a pretty complex issue. Given all that, it makes sense to go for a limited implementation. If people are writing MUDs in Chinese, they're probably in a completely different code-world (no pun intended…) from us in the first place. So it's dubious how much overlap there would be.
0.0/8