Model File Character Encoding

Revision as of 21:06, 1 June 2016 by Bbecane (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)



Analytica 4.5 can read model files use either ASCII (aka ANSI, ISO-8859-1) or UTF-8 encodings. Analytica 4.4 and earlier only only recognize ASCII. This page explains these encodings, why it matters, and how to convert between them.

ASCII Encoding

When a file is stored with an ASCII encoding, one byte is used to store each character. This means that only 255 distinct characters can appear, namely Chr(1)..Chr(255). Analytica's ASCII files use a slightly modified version of the ISO-8859-1 code page standard. Due to an oddity dating back to the early days of the Macintosh (where Analytica originated), Analytica's mapping differs from the ISO-8859-1 standard on three extended characters: Chr(173), Chr(178), Chr(179), which in ISO-8859-1 are ­, ² and ³ but in Analytica's ASCII format are ≠, ≤ and ≥.

UTF-8 Encoding

In the UTF-8 encoding, a single character may take up 1, 2 or 3 bytes. Each of these byte sequences map to a Unicode character having a character code between 1 and 65535. So whereas the ASCII encoding can only represent 255 distinct characters, UTF-8 can represent 65535 characters. This large Unicode character set includes characters from eastern Asian languages (e.g., Chinese, Japanese, Korean), Arabic, Cyrillic (Russian), Greek, Scandinavian languages, etc. It also contains numerous mathematical and scientific symbols, currency symbols, cute graphical symbols, and combining characters (accents and other marks that can be combined with any other character to form compound glyphs).

The most common characters used in English have character codes between 1 and 127, and these use a single byte in UTF-8 and are the same as in the ASCII mapping. Therefore, when a file contains only these characters, the sequence of bytes that make up the file are the same whether it is saved in ASCII or UTF-8.

The characters from Chr(128) to Chr(2047) are represented by two bytes, and Chr(2048) to Chr(65535) use three bytes. If you load a UTF-8 file into a program that interprets it as ASCII, there will show up as two or three printable but unusual characters. From example, the character "£" appears as "£".

Character Code Differences

UTF-8 maps byte sequences to Unicode characters, and Unicode has its own standard for mapping numeric codes to characters. The standard agrees with ISO-8859-1 an all characters in the ranges Chr(1)..Chr(127) and Chr(161) to Chr(255), but has a different set of characters in the range Chr(128) to Chr(160). In ASCII these are the characters "€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ", and in Unicode they are mostly non-printing control characters. Analytica 4.5 now uses the Unicode standard for Chr(173), Chr(178) and Chr(179), so these three also differ between ASCII and UTF-8 model file encodings.

Identifying Encoding

To mark a file as UTF-8, Analytica 4.5 records the encoding on the first line of the model file, for example

{ From user Lonnie, Model Bond_finance_model at 15-Mar-2013 1:09:29 AM, encoding="UTF-8" }

A byte-order mark (BOM) is optional, and is another way to mark it as UTF-8, but Analytica 4.4 doesn't recognize BOMs and will report errors about unrecognized characters if you try loading into 4.4.

Loading Analytica 4.4 models into Analytica 4.5

Analytica 4.4 models are always saved in ASCII format, and since Analytica 4.5 can read ASCII encodings just fine, you have no character encoding issues to worry about when loading Analytica 4.4 models into 4.5. For the character encoding differences, it automatically remaps them to the equivalent Unicode character, preserving the characters in the original model. In other words, Analytica 4.5's handling of character encodings is fully backward compatible, so you don't have to worry about it.

Loading Analytica 4.5 models into Analytica 4.4

If you save a model from Analytica 4.5, and then load it into Analytica 4.4, in some cases character encoding issues can cause problems; however, these encoding problems only occur if you use extended Unicode characters (those characters that have no equivalent in the ASCII character set) in you 4.5 model.

When Analytica 4.5 saves your model, it determines whether any non-ASCII characters are present. It there are any (even just one), then it saves the file in UTF-8 format. If only ASCII characters are present, then it saves it in the same ASCII encoding used by earlier releases of Analytica.

If you attempt to load a model saved in UTF-8 by Analytica 4.5 into Analytica 4.4, you'll be presented with a warning, telling you that there may be issues with extended characters. Because a UTF-8 sequence is a printable sequence in ASCII, the net impact on your model might be small. Some models contain "«null»" in certain places, and these become corrupted to "«null»", which can cause errors to appear.

Encoding conversion

If you need to convert a UTF-8-encoded model file into ASCII, read this section. This need might occur if you've used some Unicode characters, but now need to go back to using it in Analytica 4.4.

Warning: The conversion from UTF-8 to ASCII loses information. Since extended Unicode characters don't exist in ASCII, those extended characters are replaced by the '?' character.

To perform the conversion, you first need in install NodePad++, a free text editor (not associated with Lumina Decision Systems). The steps are

  1. Load the model file into NotePad++
  2. On the Encoding menu, select Encode in UTF-8 without BOM. Skip this step if UTF-8 is already selected.
  3. On the Encoding menu, select Encode in ANSI.
  4. On the first line of the file, change encoding="UTF-8" to encoding="ascii"
  5. Go to the end of the file and look for a warning message after the final Close statement, which might look like
    {!-40499|MsgBox("This file is encoded in UTF-8 format and contains non-ascii characters. To properly read UTF-8 format, you need to upgrade to Analytica 4.5 or later. If your continue to use this model in this release of Analytica, unicode and extended ascii characters will become distorted and may cause parts of the model to break.",16,"Analytica 4.5 required")}
    If you see this, remove it.
  6. Select File → Save (or Save As...)
  7. Exit NotePad++

One final note. You need to do this conversion on the UTF-8 file saved from Analytica 4.5 (not a copy saved from Analytica 4.4).

See Also

Comments


You are not allowed to post comments.