I originally thought using fonts would be pretty simple. However, proper handling of fonts has ended up being a significant effort in Windward Reports (our XML and SQL Reporting system). If you're going to do much more than place a line of text in a form, then the details start to matter.
Fonts & Glyphs
So what is a font? Fundamentally a font is a series of glyphs. What you think of as a character like the letter A is a glyph. A font is then a set of glyphs for all the letters in that font. If you get the Helvetica font, all their glyphs look one way. If you get the Times Roman font, they look another. Each is the set of glyphs from that font.
Now we need to introduce the concept of code pages. A code page is a mapping from a character number to a specific glyph. Programs originally stored each character as a byte. Then for Asian character sets there were the DBCS systems (some characters were 1 byte, some 2). Programs today mostly use Unicode, but web pages tend to be UTF-8 which is a multi-byte sequence that can be up to 4 bytes.
Why bring up encoding? Because each font has an encoding where character number 178 could return a very different glyph depending on the codepage used by the font. Most font files use Unicode so you have a standard there, but many programs still use specific code pages, where that page is mapped to the font. This is what occurs when you display ABC and the font is Wingdings so you get . So point one is you need to make sure that the encoding you use matches or is mapped to the encoding of the fonts you use.
And it gets even more complex. The characters with the values 0xE000 – 0xF8FF are undefined. Each font can make those anything they want (one use is to add the Klingon script). So a character with a value in this range is by definition tied to the font file it is using to display that font. This is how most symbol type fonts work.
Ok, so you are using Unicode, your font file uses Unicode, you pass it a string and… the string displays blank. What's going on? Well, there's no requirement that a font file have a glyph for any given character. A Symbol font won't have ABC. Most fonts used in Europe and America don't have the Chinese, Japanese, or Korean glyphs. It's not an error to use a glyph that a font does not have, but it will display nothing, not blank, but nothing (i.e. 0 points wide).
You can also hit a similar problem if you are using one of the old code pages if you want to display a glyph that does not exist in the code page. In that case you need to map in a different code page, at least for that character (this is how Word used to handle this case).
Fonts fall into several different classes. First there is proportional vs. monospaced fonts. In a monospaced font all characters are the exact same width. And the height is consistent in that all lower case letters are the same height as are all upper case. Avoid monospaced fonts as much as possible because they are much harder to read. Asian fonts are almost all monospaced because the Chinese Han characters all have identical widths and heights, so proportional would make no sense. On the flip side, Hebrew and Arabic pretty much have to be proportional.
Next is the typeface which can be serif where you get stuff at the end of their stroke, sans serif where you do not get anything extra at the end, decorative where it is way beyond normal, and Symbol that can have anything random, including barcodes that match the ASCII numbers of the character codes mapped to the glyphs. And this is just the Western European alphabets.
Now we get in to measuring fonts, and by fonts most (not all) of it is measuring glyphs. The standard measurement used for fonts is the point and while there's a lot of history to what a point originally meant, for the computer world it has been 72 points == 1 inch. You will also sometimes see twip which stands for twentieth of a Point so 1440 twips == 1 inch. And we now have EMU where 914400 EMUs == 1 inch (more here). If you work with points, you need to use floating point variables. Twips are generally ok as an integer and EMUs definitely are.
Then comes the font point size. This is a completely arbitrary number. Think of it like the diagonal size of the old CRT monitors where the actual size was close to what you expected, but was never that number. The point size determines the size of the rendered glyphs, but it has no specific measurement on the page.
Now here's where it starts to get interesting, the fontmetrics. First, everything must be measured from the baseline. Working from any other part of the font won't work – you will hit major problems. So start there. The highest drawn part above the baseline is that ascent and the lowest drawn part below the baseline is the descent, both measured from the baseline.
Then there is the spacing between two lines of text. This is a font setting as the font designer determines what is the appropriate spacing for that font. This can be returned different ways, Windows considers this the spacing you put above the next line returning a measure from baseline to baseline while Java views it as the spacing below a line before the next line and returns just this value. This leading is the spacing you place between lines of similar single spaced text. If the spacing is greater than single spacing, then you add to this value.
You generally want to get these heights for the fonts, not for the string of glyphs in the string you display. Why? Because what if a line is "we were wrox" – with no ascenders or descenders the line would be placed closer to the other lines in the paragraph and that would look weird. You also need to look at all fonts and point sizes because if some text is larger you must use the large ascent/descent/leading values. But only for the line(s) that have the larger text, not for the entire paragraph. And again, all of this is measured from the baseline which is the only way to handle mixed fonts/sizes.
Ok, height takes a bit of work but it's pretty straightforward, but the width – this gets really interesting. And by interesting I mean you have to get everything just right. Fundamentally, except for fixed width fonts, adding up the width of each glyph will not equal the width of all those glyphs rendered together. Pretty much never. Why? A couple of reasons:
- Kerning is where letters are placed based on the letter the adjoin. That is why AB stays distinct while tt overlaps quite a bit.
- Some character combinations in Latin alphabets are combined such as ae becoming æ and in German ss becoming ß.
Hebrew and Arabic glyphs are different for the same character depending on if it is at the start, middle, or end of a word. And in the case of Arabic especially the glyphs used on the ends tend to be wider than the glyphs in the middle. So the width of ﺺ is dependent of where it is in the string.
- Bi-directional fonts have an additional issue listed below.
- Complex scripts, like Indic (India) will change the glyph at a location building it up from several characters. So a three character string can be anything from 1 to 3 glyphs wide.
Very simply, you need to feed a complete, fully formatted string, to the fontmetrics API provided by the platform you are running on to get the length of the string. It's an expensive call because the string will be rendered to memory to determine the length, but there is no alternative that will be accurate. And you must use the exact same settings measuring as you do when rendering. Anytime these have not matched, we have found differences large enough that the human eye can make them out. The best way to test your code for this is to look at right aligned text, because you generally have to get the baseline position of the left end of the string when rendering so if you calculate the length wrong – it will show.
Finally we have the issue of bi-directional text (Arabic & Hebrew). Bi-directional text goes right to left, except numbers and Latin words go left to right. So it is read right to left, then on a number or sequence of Latin text you jump over to the left most point, read left to right back to where you completed the previous Hebrew/Arabic, then jump to the start of the Latin/number part and go back to right to left.
There has been a ton of research performed on when these switches should take place. There are characters that have a strong direction, characters that have a weak direction, and characters that have no directional preference. You have no prayer of correctly implementing these rules. None. But all is not lost. Pretty much every platform, including Java and Windows, has an API where you provide the string of characters in the order read, and it will render them correctly according to the rules. They also have an API for telling you where each character is located and which character you should move to if you want to move the caret 1 character forward or backwards.
You can use this API for all font rendering and caret movement regardless of text and it will work fine – on complex scripts too. It's a bit of a pain to start with this if you are not targeting bi-di or complex scripts, but if you're going to be there eventually it's best to start off using it so you don't have to re-architect your code. Trust me, you really really don't want to have to rearchitect (I had to once – OW!).
Do not copy Windows fonts to Linux or other operating systems. The fontmetrics tend to be off and the text will look off. I don't know what as TrueType is supposed to be portable, but in practice just like Java is write once debug everywhere, fonts tend to be design once tweak everywhere. Get fonts from a vendor who has optimized them for your platform.
What every developer should know series
Other blog posts in the What every developer should know series