Writing system


The Rosetta Stone is a granodiorite stele inscribed with a decree issued at Memphis in 196 BC on behalf of King Ptolemy V. The decree appears in three scripts: the upper text is Ancient Egyptian hieroglyphs, the middle portion Demotic script, and the lowest Ancient Greek. Because it presents essentially the same text in all three scripts, it provided the key to the modern understanding of Egyptian hieroglyphs.

A writing system is a communication system in which interlocutors can send messages by applying symbols that represent units of spoken language to a permanent or semi-permanent medium (e.g. stone, paper, or sand) according to some (visual or tactile) code . For example, braille, the English alphabet, and hieroglyphics_. In contrast, neither drawings, paintings, mathematical notation, nor sheet music are writing systems, because they do not represent a spoken language. Similarly, neither `Morse code`_, `sign language`_, nor semaphore_ are writing systems, because they each use a transient medium.


1   Function

Writing systems enabled men to accurately record human history in a manner that was not prone to the same types of error to which oral history is vulnerable. Soon after, it provided a reliable form of long distance communication. And with the advent of publishing, it provided the medium for an early form of mass communication.

2   Substance

Every writing system consists of a spoken language to be represented, a set of signs ("characters") ("a script") (e.g. letters and numbers), a code which maps the two (an "orthography" which literally means "correct writing").

2.1   Graphemes

A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’, etc., are all different characters. So are ‘È’ and ‘Í’. Characters are abstractions, and vary depending on the language or context you’re talking about. For example, the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings. [2]

What constitutes a grapheme is complex. The Arabic numerals, uppercase letters, lowercase letters, and punctuation marks (including the space) are all graphemes of English. Some graphemes consist of multiple characters ("multigraphs"). For example, in English, "ng" functions as a single unit in "king" or "finger" (a "digraph"). Similarly, in the C programming language, <: represents the same abstract unit as [ (i.e they are substitutable), (also :> and ]). [*]

Further, languages have a distinct grapheme for every irregular spelling. Since in English, 60% of common words have irregular spellings, [6] English has many graphemes. As an example, consider that we can spell the "sh" sound in at least thirteen different ways: "ocean", "machine", "special", "pshaw", "sure", "schist", "conscience", "nauseous", "she", "tension", "issue", "mission", and "nation". [6] (Some people devise alternative orthographies that simplify these complexities called "shorthands".)

What constitutes a grapheme also depends on the particular orthography. For example, some languages (in particular, European languages) make use of dependents symbols such as accents or diacritics. For example, Danish makes use of “ë” and “å” and French uses "é" and "ô". Whether or not a diacritic combination is considered a grapheme depends on the particular orthography. In Northern Sámi, for example, the combination "á" functions as a grapheme: it is enumerated separately in the alphabet and has its own place in the sort order. But in Danish, “á” is considered to be a variant of "a" and so is not a separate grapheme in that language. [4] To see this more clearly, note that English would function like Danish if it represented "q", which is considered an independent character despite the fact that it may only appear before "u", as a diacritic.


Different fonts: one character, different glyphs.

In some scripts, characters can have more than one shape due to certain behaviours of the script. This has nothing to do with changing fonts. For example, in Greek script, the sigma has two different shapes, according to its position within a word. [4]

Graphemes are abstract. A glyph is a visual representation of a character; a mark made on screen or paper. A glyph is a specific shape that represents a grapheme in a specific typeface. A writer may write the same grapheme using different glyphs ("allographs"), but each is interpreted the same. Graphemes can be thought of as the denotation of the glyphs which represent them.

Graphemes are represented with angle brackets, e.g. <a> or <b>.

2.2   Script


Ancient Sumerian script. The glyphs become more abstract, probably as they move away from representing words to representing phonemes.

A script consists of a mix of letters, syllabic characters, and logograms.

Writing systems may include characters which maps to different kind of linguistic units. For example, in English, the letter "a" maps to a phoneme, while the numeral "1" maps to a morpheme. A special set of symbols known as punctuation is used to aid in structure and organization of many writing systems and can be used to help capture nuances and variations in the message's meaning that are communicated verbally by cues in timing, tone, accent, inflection or intonation.

A sequence of characters is called a string. In computing, characters also include control characters such as carriage return or tab, as well as instructions to printers or other devices that display or otherwise process text.

2.2.1   Alphabet

In languages that use alphabetic writing systems, the graphemes stand in principle for the phonemes (significant sounds) of the language. In practice, however, the orthographies of such languages entail at least a certain amount of deviation from the ideal of exact grapheme–phoneme correspondence. A phoneme may be represented by a multigraph (sequence of more than one grapheme), as the digraph sh represents a single sound in English. Some graphemes may not represent any sound at all (like the b in English debt), and often the rules of correspondence between graphemes and phonemes become complex or irregular, particularly as a result of historical sound changes that are not necessarily reflected in spelling.

While most alphabets have letters composed of lines (linear writing), there are also exceptions such as the alphabets used in Braille, fingerspelling, and Morse code.

Alphabets are usually associated with a standard ordering of their letters. This makes them useful for purposes of collation, specifically by allowing words to be sorted in alphabetical order. It also means that their letters can be used as an alternative method of "numbering" ordered items, in such contexts as numbered lists.

2.2.3   Logograms

A logogram, or logograph, is a grapheme which represents a word or a morpheme (the smallest meaningful unit of language). This stands in contrast to phonograms, which represent phonemes (speech sounds) or combinations of phonemes, and determinatives, which mark semantic categories.

Logograms are commonly known also as "ideograms". Strictly speaking, however, ideograms represent ideas directly rather than words and morphemes, and none of the logographic systems described here is truly ideographic.

Logographic systems, or logographies, include the earliest true writing systems; the first historical civilizations of the Near East, Africa, China, and Central America used some form of logographic writing.

A purely logographic script would be impractical for most languages, and no natural ones exists.

The main difference between logograms and other writing systems is that the graphemes aren't linked directly to their pronunciation. An advantage of this separation is that one doesn't need to understand the pronunciation or language of the writer to understand it. The reader will recognise the meaning of 1, whether it is called one, ichi or wāḥid in the language of the writer. Likewise, people speaking different Chinese dialects may not understand each other in speaking, but may do so to a significant extent in writing even if they don't write in standard Chinese. Therefore, in China, Vietnam, Korea and Japan prior to modern times, communication by writing (筆談) was the norm of international trade and diplomacy.

This separation, however, also has the great disadvantage of requiring the memorization of the logograms when learning to read and write, separately from the pronunciation.

2.3   Orthography


The mapping from phonemes to English strings through IPA symbols. [5] Notice that different phonemes can be represented by the same string, that the same phoneme can be represented by different strings, and that some phonemes are represented by multiple characters.

An orthography is a mapping from units of a spoken language (e.g. phonemes, syllables, morphemes, or words, plus grammatical category) to ways to representing them in writing (i.e units of written language) ("graphemes").

3   Properties

3.1   Completeness

Writing systems are conceptual systems, as are the languages to which they refer. Writing systems may be regarded as complete according to the extent to which they are able to represent all that may be expressed in the spoken language.

3.2   Stability

Writing systems generally change more slowly than their spoken counterparts. Thus they often preserve features and expressions which are no longer current in the spoken language.

3.3   Suggestivity

A notation is suggestive if the forms of the expression arising in one set of problems suggest related expressions which find application in other problems. [7]

Suggestiveness of a notation may make it seem harder to learn because of the many properties it suggest for exploration. For example, the notation +.x for matrix product cannot make the rules for its computation more difficult to learn, since it at least serves as a reminder that the process is an addition of products, but any discussion of the properties of a matrix product in terms of this notation cannot help but suggest a host of questions such as is v.^ associative? Over what does it distribute? [7]

4   Classification

A phonographic writing system uses symbols to represent components of auditory language, which in turns refers to things or ideas. For example, languages that use keywords, like Javascript or Python.

An ideographic writing system refers to ideas independently of their pronunciation in a language. For example, languages that use symbols, like Coffeescript.

The general attributes of writing systems can be placed into broad categories such as alphabets, syllabaries, or logographies. In the alphabetic category, there is a standard set of letters (basic written symbols or graphemes) of consonants and vowels that encode based on the general principle that the letters (or letter pair/groups) represent phonemes (basic significant sounds) of the spoken language. A syllabary typically correlates a symbol to a syllable (which can be a pairing or group of phonemes, and are considered the building blocks of words). In a logography, each character represents a word, morpheme or semantic unit (which themselves can be pairings or groups of syllables).

The principal types of phonographic graphemes are logograms, which represent words or morphemes (for example Chinese characters, the ampersand & representing the English word and, Arabic numerals); syllabic characters, representing syllables (as in Japanese kana); and alphabetic letters, corresponding roughly to phonemes (see next section).

5   History

The art of writing was invented in Egypt in about the year 4000BC and in Babylonia not much later. In each country, writing began with pictures of the objects intended. These pictures quickly became conventionalized so that words were represented by ideograms as they still are in China. In the course of thousands of years, this cumbrous systems developed into alphabetic writing. [9]

5.1   Greece

It was probably in the first half of the 8th century B.C. that the Greeks, through close contacts with the Phonecians, borrowed the `Semitic alphabet`_ and transformed it into their own. Thus the first letter of the Semitic alphabeta, alep (meaning "ox") became alpha, bet (or house) became beta, gimel (camel) became gamma, and dalet (house) became deleta. The first archaeological proof the of this early Greek script is on pottery vessels or shard from the second half of the 8th century B.C.

The Greek script is called Euboian Script.

One of the most important results, to the Greeks, of commerce or piracy--at first the two are scarcely distinct--was the acquisition of the art of writing. Although writing had existed for thousands of years in Egypt and Babylonia, and the Minoan Cretans had a script (which has not been deciphered), there is no conclusive evidence that the Greeks knew how to write until about the tenth century B.C. They learnt the art from the Phoenicians, who, like the other inhabitants of Syria, were exposed to both Egyptian and Babylonian influences, and who held the supremacy in maritime commerce until the rise of the Greek cities of Ionia, Italy, and Sicily. In the fourteenth century, writing to Ikhnaton (the heretic king of Egypt), Syrians still used the Babylonian cuneiform; but Hiram of Tyre ( 969-936) used the Phoenician alphabet, which probably developed out of the Egyptian script. The Egyptians used, at first, a pure picture writing; gradually the pictures, much conventionalized, came to represent syllables (the first syllables of the names of the things pictured), and at last single letters, on the principle of "A was an Archer who shot at a frog." * This last step, which was not taken with any completeness by the Egyptians themselves, but by the Phoenicians, gave the alphabet with all its advantages. The Greeks, borrowing from the Phoenicians, altered the alphabet to suit their language, and made the important innovation of adding vowels instead of having only consonants. There can be no doubt that the acquisition of this convenient method of writing greatly hastened the rise of Greek civilization. [2]

Men invented the first writing systems at the beginning of the `Bronze Age`_ in the late Neolithic Era of the late 4th millennium BCE. However, the development of writing systems, and the process by which they have supplanted traditional oral systems of communication, have been sporadic, uneven and slow.

Many ancient writing systems used pictograms purely for their sounds regardless of their meaning ("the rebus principle") to represent words that would otherwise be hard to be represent with a pictogram. For example, one can represent the sentence "I can see you" by using the pictographs of "eye—can—sea—ewe."

It is an historical accident that the blank is used as a word divider; some ancient scripts used a bar or some other marker that looks more like a "real" character.

Inputting complex characters can be cumbersome on electronic devices due to a practical limitation in the number of input keys.

A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages. grapheme may or may not carry meaning by itself, and may or may not correspond to a single phoneme. Graphemes include alphabetic letters, typographic ligatures, Chinese characters, numerical digits, punctuation marks, and other individual symbols of any of the world's writing systems.

A grapheme is an abstract concept, similar to a character in computing. A glyph is a specific shape that represents that grapheme, in a specific typeface. For example, the abstract concept of "the Arabic numeral one" is a grapheme, which would have two different glyphs (allographs) in the fonts Times New Roman and Helvetica.

The importance of nomenclature, notation, and language as tools of thought has long been recognized. In chemistry and botany, for example, the establishment of systems of nomenclature by Lavoisier and Linneaeus did much to simulate and to channel later investigation. [7]

5.2   Ink


According to ancient Chinese literature, Tien Tcheu invented ink between 2698 and 2587 BC. For centuries, the Chinse province of Kiang-si held a complete monopoly on ink manufacturing. The ink of that era became commonly known as India ink and was solid as solid cakes for use with the brush the Chinese employed in writing their characters.

Wood block printing was invented in China at the end of the six century, resulting in an enormous increase in ink consumption.

Pi Sheng invented moveable type in the eleventh century. Movable type was not used in Europe until 1440 by Johann Gutenburg of Mainz, Germany. William Caxton was the father of printing in England. In 1474, he built his first press at Westminster and three years later turned out the first English-language book titled the "The Recuyell of the Histories of Troy". Soon after, presses sprang up all over England.

A new era dawned on the printing trade with the discovery of photo-lithography by Fox Talbot in England in 1852. This permitted the use of drawings and photographs to enrich the page.

Inks for pen use constitute a small portion of the ink maker's total output. The bulk of production consists of typographic, lithographic, and other special inks. (One Montreal daily paper uses no less than 1452 pound of ink every day. A weekly paper uses 5676 pounds per issue.)

6   Representation

6.1   Character set encoding

A character set encoding (or character encoding) is a system for representing characters in terms of binary numbers. [3] Examples of character set encoding include ASCII, UTF-8, latin-1_.

Any character set encoding involves at least these two components: a set of characters and some system for representing these in terms of the processing units used within the computer. [3]

In programming, a character is a code point for a particular character set. A single code point doesn't always encode a logical glyph. It may be a combining character, joiner, or control signal. [10]

Consider the length function of a string. Is it intended to return the number of glyphs, combined character, or underlying code points? [10] The only safe option for length is to return the number of code points.

Storing strings as an array of code points can be space inefficient. More compact variable-length encodings are frequently used; differing numbers of bytes are used to represent a single code point. For example, in UTF-16_ a code point may comprise a 2-byte or 4-byte sequence. A surrogate is a value that doesn't encode a full character. [10] Trying to work with an encoded string as a sequence of characters of characters is very troublesome. What should the length function return, the number of encoding values or the number of code points? It's theoretically possible to write a String class that handles abstracts the underlying encoding of the string, but it would be inefficient. Basic operations like indexing a string or forward scanning require lots of overhead. In practice, it's simpler just to decode the string for processing. (The only drawback is this uses more memory, but in practice it's never an issue.) [10]

If a programming language supports a string class, then it either needs multiple string classes for each encoding, or needs to pick a default encoding. But then the word "string" becomes overloaded. It can refer to the string class, or to an array of characters. [10]

7   Further reading

8   Footnotes

[*]Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set.

9   References

[2](1, 2) Unicode HOWTO. https://docs.python.org/2/howto/unicode.html
[3](1, 2) Character set encoding basics. http://scripts.sil.org/cms/scripts/page.php?item_id=IWS-Chapter03
[4](1, 2) Understanding characters, keystrokes, codepoints and glyphs. http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=IWS-Chapter02
[5]The sound of English and the International Phonetic Alphabet. http://www.antimoon.com/how/pronunc-soundsipa.htm
[6](1, 2) The not complete-idiot's guide to: Alternative Handwriting and Shorthand Systems for Dummies. http://www.alysion.org/handy/althandwriting.htm
[7](1, 2, 3, 4) Kenneth E. Iverson .1979. Notation as a tool of thought
[8]Nov 1948. B.C. Cred Unionist. Printing Ink - From 2698 B.C. http://enterprise-magazine.com/wp-content/uploads/2015/12/1948-November-magazine.pdf
[9]Bertrand Russel. 1945. The History of Western Philosophy. http://www.ntslibrary.com/PDF%20Books/History%20of%20Western%20Philosophy.pdf
[10](1, 2, 3, 4, 5) Edaqa Mortoray. 2013-08-13. We don't need a string type. https://mortoray.com/2013/08/13/we-dont-need-a-string-type/


Mathematical notation providers perhaps the best-known and best-developed example of language used consciously as a tool of thought.

By relieving the brain of all unnecessary work, a good notation sets it free to concentrate of more advanced problems, and in effect increases the mental power of the race.

—A. N. Whitehead

The quality of meaning compressed into small space by algebraic signs is another circumstance that facilities the reasonings we are are accustomed to carry on by their aid.

—Charles Babbage

Mathematical notation has serious deficiencies. It lacks universality, and must be interpreted different according to the topic, according to the author, and even according to the immediate context.

Programming language, because they were designed for the purpose of direct computers, offer important advantages as tools of thought. Not only are they universal (general-purpose), but they are also executable and unambiguous.

Executability makes it possible to use computers to perform extensive experiments on ideas expressed in a programming language. The lack of ambiguity makes possible precise thought experiments.

In other respects, however, most programming languages are decidedly inferior to mathematical notation and are little used as tools of thought in ways that would be considered significant by, say, an applied mathematician.

The this of the present paper is that the advantages of executability and universality found in PLs can be effectively combined in a single coherent languages.

A good notation should embody characters familiar to any use of mathematical notation:

l N is list up to N / is reduction +/ is sum x/ is product is scan +is sum scan, e.g. +l5 == 1 3 6 10 15 xis product scan phi is reverse 5p6 = 6 6 6 6 6 N * M is power , catenates its arguments T produces a representation of its right argument in the radix specified by the left argument (e.g. 2 2 2 T 3 = 0 1 1; 2 2 2 T 4 = 1 0 0).

The term operator used in the strict sense defined in mathematics refers to an entity which applies to functions to produce functions. For example the derivative operator, reduction, and scan. A function produced by an operator (such as +/) will be called a derived function.

Indexing is denoted by an expression of the form X[I] where I is a single index or array of indices of the vector X. For example, if X <- 2, 3, 6, 7 then X[2] is 3 and X[2 1] is 3 2.

Drop is denoted by the down arrow KvX and is defined to drop K elements from X from the head if K > 0 and from the tail if K < 0. The take function does the opposite, denoted by the up arrow.

In order to use different representations conveniently, it is important to be able to express the transformations between representations clearly and precisely. Conventional mathematical notation is often deficient in this respect.

Although Cajori does not even mention the rules for the order of execution in his two-volume history of mathematical notations, it seems reasonable to assume that the motivation for the familiar hierarchy arose form a desire to make polynomials expressible without parentheses. The convenient use of vectors in expressing polymonimals, as in +/CxX*E does much to remove this motivation. Moreover, the rule adopted in APL also makes Horner's efficeitn expression for a polynomial expressible without parentheses:

+/3 4 2 5 x X * 0 1 2 3 <-> 3 + X x 4 + X x 2 + X x 5

ELI5: Why is W called "double U" when it is clearly "double V"? https://www.reddit.com/r/explainlikeimfive/comments/5wvocj/eli5_why_is_w_called_double_u_when_it_is_clearly/

When Old English was written, it used a mixture of Latin letters and older runes. One of these runes was Wynn, which was used to represent the wound that w gives today: https://en.wikipedia.org/wiki/Wynn That runes was sometimes replaced by the combination uu - a double u - for the same sound. In german, the letter v changed in sound to be pronounced as f in most cases (it still is). In a few cases the v-sound was retained. To distinguish these cases, scribes began to write vv for these. When printing was developed in what is today Germany (and to some extent Italy, but that is less relevant here), the `printing press`_ manufacturers made types for the letters that they had. Since the combination vv was very common, they made a letter for it - w. In most languages letter is called "double-v". These printing presses and the letters for them were exported everywhere, including to England. The English quickly realized that they didn't have types for all their letters, so they made do with what they had. Since English didn't have the w before printing, they simply reused that letter for the Wynn rune, which was missing. It is called "double-u" because it was also sometimes written as "uu"

Graphite has always been used at the core of pencils. Concerns about "pencil lead" were about lead being in the yellow paint on the pencils.