Israeli Standard SI 4281 - English Translation

Av 5758 - August 1998

Information Technology: Implementation of Hebrew in the Hypertext Markup Language (HTML)

Translated and annotated by Jony Rosenne (rosennej@qsm.co.il), December 2000.

Translator Notes:

Introduction

This standard defines the implementation of Hebrew in the Hypertext Markup Language (HTML). It is based on the international standards and specifications relating to the Internet and to the internationalization of the World Wide Web.

The most important document in this respect is RFC 2070, Internationalization of the Hypertext Markup Language (1977) [Note: RCF 2070 has since been incorporated in HTML 4.]

Other standards associated with HTML, such as Java and Javascript, shall implement Hebrew in a manner compatible with this standard.

1. Scope

This standard defines a uniform implementation of the Hebrew language in HTML, the encoding of Hebrew information in accordance with the relevant international standards and specifications, and the manner of presenting this information.

2. References

Israeli Standards

SI 1311 (1989) Information Processing: ISO 8 bit coded character set for information interchange. [Note: This standard is equivalent to ISO 8859-8.]
SI 1311.1 (1996) Information Technology: ISO 8 bit coded character set with Hebrew points
SI 1311.2 (1996) Information Technology: ISO 8 bit coded character set with Hebrew accents
SI 1680 (1996) Open Systems Interconnection - Hebrew Implementation in Standard Generalized Markup Language (SGML)

International Standards

ISO 8859-1: 1987 Information Processing - 8-bit Single-Byte Coded Graphic Character Sets - Part 1: Latin Alphabet No. 1
ISO 8859-8: 1988 Information Processing - 8-bit Single-Byte Coded Graphic Character Sets - Part 8: Latin/Hebrew Alphabet
ISO 8879: 1986 Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML)
ISO/IEC 10646-1: 1993 Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane
ISO/IEC 10646-1: 1993, Amendment 2 (1996) UCS Transformation Format 8 (UTF-8)

International Documents

IETF RFC 1738 Uniform Resource Locators (URL) (1994)
IETF RFC 1766 Tags for the Identification of Languages (1995)
IETF RFC 2070 Internationalization of the Hypertext Markup Language (1997). [Note: This is now incorporated into the HTML 4 specification]
Unicode 2.0 The Unicode Standard, Version 2.0 (1996)

3. Definitions

3.7 Directional Formatting Codes

Seven characters that guide the bidirectional implicit algorithm in special cases when the analysis of the properties of text characters is not sufficient to determine the character directionality. These codes are listed in Appendix A, Table A-5. They are defined in ISO-10646-1, Annex D. [Note: See this link for the directional formatting codes].

4. Reference Processing Model

5. Character Set

5.1 The Document Character Set

The document character set is UCS-4 according to ISO 10646-1.

The values 32 to 95 are allocated to the parallel values of ISO-8859-1 that are identical also to US-ASCII and to SI 1311.

The UCS includes the Hebrew letters, the Hebrew points in SI 1311.1 and the cantillation marks in SI 1311.2 (see Appendix A for the details). [Note: See these links for the letters, points and cantillation marks].

The UCS works according to the implicit directionality. [Note: This means the Unicode bidirectional algorithm. For an up-to-date specification, see Unicode Standard Annex #9, The Bidirectional Algorithm]

Notes:

1. The UCS is not single valued, it contains a large number of precomposed characters. A precomposed character is equivalent to the sequence of characters it is composed of. For example, the UCS contains the Hebrew letters with Dagesh as precomposed characters. Each such character is equivalent to the letter itself followed by Dagesh.

2. It is recommended not to use the Hebrew precomposed characters for encoding information.

5.2 External Encoding of Characters

Following are the alternatives for external encoding of the Hebrew characters for communication between systems. The encoding shall be indicated in the charset parameter of HTTP and also in the document under the META tag. [Note: See Specifying the character encoding.]

  1. Encoded according to SI 1311 with implicit directionality. The value of the parameter shall be: charset=ISO-8859-8-I
  2. Encoded according to UTF-8. The value of the parameter shall be: charset=UTF-8
  3. Encoded as 16 bit codes. The value of the parameter shall be: charset=ISO-10646-UCS-2
  4. Encoded according to UTF-16. The value of the parameter shall be: charset=UTF-16

SI 1311 is the preferred encoding for Hebrew documents. UTF-8, UCS-2 and UFT-16 are recommended for multilingual systems. Any other standard encoding may be used, in which case the Hebrew characters will be represented by numeric character references based on UCS.

5.2.1 Determination of the External Encoding

The external encoding is specified by the charset parameter of the HTTP header. If it is absent, the META tag parameter should be used. If it too is absent, and if the document was received as a result of following a hyperlink, the charset parameter of the A tag will be used, if it exists.

If none of these apply, the user agent may determine an encoding according to its considerations.

It is recommended that the implementation will allow the user to manually select an encoding that suits him. It is desirable that this selection will apply to the displayed document and all its components, and that when the next document is displayed or the display is refreshed, the encoding will be determined once again according to the above.

5.4 Rendering

This standard does not specify how to render characters that are impossible to display, such as Chinese or Japanese characters when there is no suitable font.

5.4.1 Rendering Hebrew Characters

Software applications shall render all the characters in SI 1311, and shall select one of the following alternatives to render the points and cantillation marks of SI 1311.1 and 1311.2:

  1. Render them correctly. If the system can only render some of them only they shall be rendered. The correct rendering of the points and cantillation marks is a complex function of the base letter and the collection of points and cantillation marks that join it and is out of the scope of this standard.
  2. Do not render them. No indication should be given to the user that points and cantillation marks were not rendered, unless he explicitly requests it.

Note: This is a special case of handling undisplayable characters according to HTML 4.

This standard does not require the rendering of additional characters of the UCS, not even those characters that are associated in the UCS with the Hebrew script but are not included in Israeli standards.

5.4.2 Handling Unrenderable Characters

When a document is saved, the characters that are not supported shall be saved (including characters that are not rendered).

When a document is transmitted, that characters that are not supported shall be preserved if the external encoding allows it. When it does not, they should be encoded as numeric character references.

If the external encoding supports only some of the points and cantillation marks, those characters that are supported shall be preserved and the others shall be suppressed with no indication of error.

6. Directionality

The Hebrew and Arabic scripts are defined in HTML and in the UCS as having implicit directionality. Implicit directionality renders the text according to the directionality property of each character and the base directionality of the block.

The bidirectional algorithm is applied to the text that is formed as the result of the interpretation of the HTML markup. The markup itself, although it appears as text, does not participate in the bidirectional algorithm.

6.1 The Implementation of Directionality in HTML

HTML includes attributes and tags for the determination of directionality.

The default base directionality is left-to-right. The dir attribute allows specifying right-to-left directionality. The directionality operates in a hierarchical manner, according to the hierarchy of the various elements of the document.

6.1.1 The dir Attribute

The dir attribute is defined for most of the tags in HTML. Its values are RTL (right-to-left directionality) and LTR (left-to-right directionality).

The meaning of the dir attribute depends on the nature of the element, whether it is a block-level element or an inline element.

6.1.1.1 Segment Directionality

For block-level elements, the dir attribute specifies the base directionality of the element. In its absence, the element inherits the base directionality from the element that includes it, up to the level of the document (the HTML tag).

The nesting level for each such element shall be determined according to its base directionality. This means that if the text includes directional formatting codes that require a PDF to end them, and the PDFs are missing, the bidirectional algorithm will work as if they were present at the end of the element.

For inline elements the dir attribute specifies an embedding level and is equivalent to an RLE or LRE. The end tag is equivalent to a PDF.

Note: The dir attribute of the IMG tag applies to the alt text.

6.1.1.2 Pre-formatted segments

Elements such as PRE, XMP and LISTING specify pre-formatted text but allow the dir attribute. The content of these elements is pre-formatted for the division of the lines, but not for the bidirectional algorithm. The bidirectional algorithm shall be applied to each line separately, while on other elements it is applied to the whole element.

6.1.2 The Bidirectional Override Tag (BDO)

The BDO tag overrides, for the purpose of the bidirectional algorithm, the bidi properties of the characters it includes, and enforces the specified directionality. It is equivalent to an RLO or LRO. The end tag is equivalent to a PDF.

6.1.3 Directional Formatting Codes

The UCS includes 7 directional formatting codes. [Note: See this link for the directional formatting codes].

The entities ‏ and &lrm: represent the characters RLM and LRM.

It is recommended that the other directional formatting codes should not be used in HTML. The equivalent HTML markup should be used.

6.1.4 Elements with no directionality

Some HTML elements have no directionality, although they may contain text that has. These elements will be laid out as if they were neutral characters, i.e. the element in its entirety shall be placed in the line as if it was a single neutral character. If the element has textual content, the bidirectional algorithm will process this content separately.

If a different layout of the elements is needed, LRM and RLM should be used.

Elements of this type are IMG, INPUT, SELECT, TEXTAREA, APPLET, EMBED and further elements that may be defined in the future.

The element BR indicates a new line, has no effect on the bidirectional algorithm and does not indicate a block separator.

7 The Language Attribute

The language attribute code for Hebrew is "he". The value "he-IL" and the obsolete values "iw" and "iw-IL" shall be supported as equivalent. The language code for Yiddish is "yi", and the obsolete value "ji" is equivalent.

Note: The language attribute has no effect on directionality.

8 Forms

Hebrew affects three aspects of forms: The encoding, the keyboard and rendering.

8.1 Encoding the reply

It is not recommended to use the GET method for fields where the reply may include Hebrew. [Note: See the note on the form submission method in the HTML 4 specification.]

For the POST method, when the reply contains Hebrew and the ACCEPT-CHARSET attribute includes an encoding that supports Hebrew, this encoding should be selected.

8.2 The Keyboard

When the user enters a right-to-left input field the keyboard language should be set to Hebrew and the cursor placed on the right hand side of the field. When the user enters a left-to-right input field the keyboard language shall be set to the foreign language and the cursor placed on the left hand side of the field.

If the user had overridden the base directionality of the input field this will be indicated in the response by surrounding the reply with the characters LRM or RLM, if the encoding of the reply supports them.

8.3 Rendering the Form

The elements of the form, INPUT, SELECT and TEXTAREA, shall be placed as if they were neutral characters.

9. Formatting

9.1 Alignment

The default alignment depends on the base directionality of the text. For left-to-right elements, the default is align=left, while for right-to-left elements, the default is align=right.

When full justification is specified (align=justify), the last line shall be aligned according to the base directionality of the element.

9.2 Tables

The directionality of a table is either the inherited directionality or that specified by the dir attribute for the TABLE element. For a right-to-left table, the first column is on the right side. For a left-to-right table, the first column is on the left side.

9.3 Lists

The directionality of a list is either the inherited directionality or that specified by the dir attribute for the list element (OL, UL, DL). For a right-to-left list, the bullets, numbers or terms are on the right side. For a left-to-right list, the bullets, numbers or terms are on the left side.

Appendix E - Compatibility (Informational)

At the moment there is a large quantity of Hebrew information on the Internet using visual directionality, and not as specified in this standard.

For compatibility, the parameter charset=ISO-8859-8 defines encoding according to SI 1311 using visual directionality. [Note: See bidirectionality and character encoding in the HTML 4 specification.] This parameter should be added to existing documents when it is not possible to convert them to the standard.

Note: Support for explicit directionality, charset=ISO-8859-8-E, is not required.

Appendix F - Arabic (Informational)

It is recommended that software supporting this specification will also support the Arabic language.

---

© 2000 - 2001 Jonathan Rosenne. All rights reserved. Last modified January 5, 2001.

The latest version of this document resides at http://www.qsm.co.il/Hebrew/si4281e.htm

Please send your comments to Jonathan (Jony) Rosenne, rosennej@qsm.co.il

---

Valid HTML 4.0!   Valid CSS!