Civilisations write for a variety of reasons but if examples such as those of Mesopotamia are a guide, writing generally began with account and record keeping. However, one exception to this seems to be the writing in Harappan civilisation, whose use ceased around 1900 BC.

The Harappan civilisation (2600-1900 BC) flourished in the western part of the Indian subcontinent along the river Indus and probably another river now called Ghaggar-Hakra. These rivers flow along the length of the present day Pakistan and northern and western parts of India. The civilisation was spread over a million square kilometres with several large urban centres housing a few tens of thousands of people each. This pre-iron age civilisation had mastered large-scale engineering and water management better than some modern cities of the subcontinent today.

Yet, the love of the Harappans was in miniatures, as exemplified by their seals. Several thousands of miniatures on all kinds of media have been found in excavations later. These media include terracotta, ivory, copper, ceramics, and stones in seals of sizes between 1cm and 5cm. The Harappans drew animals, abstract designs, human forms, story scenes and other interesting imaginary and real objects. The most conspicuous, however, is the string of writing on about 70% of all inscribed objects. This writing, which covers complete seals or just their top, is often accompanied by drawing of animals and a decorated object.

Past work

Since their first discovery more than 130 years ago, there has been a lot of speculation on the meaning of Harappan writing. None of these have found general acceptability. Some experts have even speculated whether the writing is linguistic at all.

In the early seventies, Iravatham Mahadevan, a civil servant, developed a passion for the mystery of Indus writing. Working on card punching computers at the Tata Institute of Fundamental Research (TIFR), he created the only published corpus and concordance of Harappan writing, listing about 3700 seals with writing. He showed that Indus writing has about 417 distinct signs in specific patterns. The average size of writing is five signs and the largest text in a single line is 14 signs. He also established the direction of writing as right to left. Asko Parpola subsequently published two volumes of photographs of all known Harappan inscriptions.

After a flurry of activity between 1960s and 1980s, the progress has been slow in recent years except for one thesis at Harvard University in USA by Bryan Wells. As a result of subsequent advances in computer science, we now have better understanding of the structure of formal and spoken languages and more powerful computational tools to analyse unknown writing.

Fresh probe

A group of scientists decided to revisit the problem using computational and statistical techniques to look for patterns in Indus writing1,2,3. The principal tools applied were of pattern searches.

Part of the team that says the Indus script is linguistic. From left to right: Hrishikesh Joglekar, Nisha Yadav, Rajesh Rao and Mayank Vahia.

Earlier work had already shown that there are specific patterns of beginners, enders and middle signs, all indicating systematic writing. In addition, signs tended to have affinity both in terms of neighbouring signs and location within text string. Specific patterns of sign pairings and positional preferences that go far beyond random writing were established. It was also possible to subdivide long texts into shorter segments of two, three or four signs, indicating systematic writing with complex internal structure.

More recent work used first order Markov models to search for pattern in Indus writing using the corpus created by Mahadevan. Using these techniques, they extensively mined the data to identify the patterns of Indus writing. These methods, as used by them, have been employed to search for nearest neighbouring signs for any given sign, creating a matrix capturing which signs typically precede or succeed any given sign.

The basic idea behind the fresh probe was that any writing system has to be rigid enough to convey unambiguous meaning and yet be flexible enough to allow expression of a large variety of information. For example, in English the word 'the' can be followed by a large number of other words (common nouns such as 'man' or 'woman') but verbs such as 'ate' or 'walked' cannot immediately succeed 'the'. Even at the level of single words, the letter 't' for example can be succeeded by a lot of letters but not 'x', or 's'. On the other hand, the letter 'q' is almost always succeeded by 'u'. This mix of flexibility and rigidity distinguishes linguistic writing from other writing.

He examined writing systems with such sequential structure (including the Indus script) and quantified the amount of flexibility in the choice of the next symbol using a form of entropy called conditional entropy. This measures the level of disorder (or flexibility) that a writing system can tolerate.

In order to understand the level of entropy in Indus writing and compare it with other writing systems, the researchers calculated the conditional entropy of Harappan writing along with other languages such as Rigvedic (the most ancient) Sanskrit, Old Tamil (from the Dravidian language family that forms the basis of a large number of south and central Indian languages and even a language spoken in Afghanistan), Sumerian, English, non linguistic information coding systems such as DNA and proteins, an artificially created computer language Fortran, and two artificially created non-linguistic systems exhibiting highly disordered or extremely rigid sequential structure.

The results showed that the conditional entropy for the Harappan script falls right along with other linguistic writing and is different from non-linguistic information coding systems. The conditional entropy for the Indus script was found to follow a trend very unlike non-linguistic systems (even the ones that code information – such as DNA and computer language Fortran) and on the other hand, is very similar to language systems. In order to define the bounds of random versus rigid symbol order, the researchers took two controls, representing the most flexible and most rigid sequence of signs. Since no database of such corpus was available, they created long strings of data – the limits within which all writing must fall. In another graph, they plotted the trends for other systems.

The results indicated that Indus texts exhibit the same type of flexibility in symbol order as many natural languages. In earlier work, they showed that signs, sign pairs, triplets and even quads have specific and significant preferences in pairing as well as location in the written string. Given such regularities in the sequential structure of Indus texts and the conditional entropy results, the evidence for linguistic writing becomes very strong.

"The work marks an important milestone in our understanding of the Indus script and the intellectual growth of the Indus Civilisation. It has shown in a language independent manner that Indus writing may indeed be linguistic. The next step would be to understand the grammatical structure," says Mayank Vahia, Principal investigator of the international collaborative programme.

Rajesh Rao, an associate professor of computer science and engineering at the University of Washington and lead author of the study, says Indus script seems to have statistical regularities that are suggestive of natural languages.

So does that mean that we are close to deciphering the script? The researchers claim that they have reached a stage where they can write in Harappan script but cannot read it! "Reading it will take a lot more effort in identifying its distribution, fine structure and other aspects which are not apparent yet," Vahia says.

So the demonstration that Indus writing may indeed be linguistic only means that more intense work will have to be done to identify its fine structure before one can identify which language family it belongs to, and eventually, decipher its meaning.

Criticism

On the basis of analysis of frequencies of single signs and other arguments, an American linguist trio made the bold proposition that Indus people were illiterate and that Indus writing was a pre-writing collection of symbols.

They reacted with fury to the latest suggestion that the script is linguistic, saying the work was faulty and the results false. Their main objections were: i) certain datasets used as controls were artificial, ii) counter examples can be given, of non-linguistic systems, which produce conditional entropy graphs like those presented in the latest work, iii) the absence of writing material and long writing is 'proof' that the Indus people were illiterate.

Vahia and his group say the response is a result of 'elementary confusion between the nature of deduction and induction.' "The artificial data sets in our work represent controls, necessary in any scientific investigation, which delineate the limits of what is possible," he says. The conclusions do not depend on the controls, but are based on comparisons with real world data: DNA and protein sequences, various natural languages, and FORTRAN computer code. "All our real world examples are bounded by the maximum and the minimum provided by the controls, which thus serve as a check on the computation," he clarifies.

Vahia says their work does not make a claim that conditional entropy is enough criterion to distinguish between language and non-language. "So, counter examples don't matter. Our methodology is quite different: we consider languages such as English and Sanskrit, and seek out what is common between them, thus defining the necessary conditions for language".

Vahia's group says several West Asian writing systems, Proto-Cuneiform, Proto-Sumerian, and the Uruk script, have statistical regularities in sign frequencies and text lengths which are remarkably similar to the Indus script. The lack of archaeological evidence for long texts in the Indus civilisation does not automatically imply that they did not exist.

"There is a long history of writing on perishable material like cotton and bark in the subcontinent using equally perishable writing implements. Writing on such material is unlikely to survive the hostile environment of the Indus valley. Thus, long texts might have been written, but no archaeological remains are to be found," he says.