Linguistic Issues In Encoding Sanskrit

Author: Peter M. Scharf Malcolm D. Hyman
Publisher: Motilal Banarsidass Publishers Pvt. Ltd.
Edition: 2012
ISBN: 9788120835399
Pages: 290 (8 B/W Illustrations)
Cover: Hardcover
Other Details: 9.0 inch X 5.5 inch
Weight 540 gm
About the Author

PETER SCHARF’ is an expert in Indian Linguistic traditions. After earning his doctorate in Sanskrit at the University of Pennsylvania and studying vyakarana in Varanasi, he taught Sanskrit at Brown University. He is the founder and director of the Sanskrit Library, directed an NSF-funded project to integrate Sanskrit texts, lexical resources, and linguistic software, and currently directs two NEH-funded projects to enlance access to Sanskrit manuscripts and to synthesize and revise Sanskrit lexical sources. In 2012- 13, he holds a Blaise Pascal International Research Chair in the Laboratoire d’Histoire des Theories Linguistiques at the Universite Paris Diderot.

MALCOLM HYMAN was an expert in classics and digital humanities. After earning his doctorate in classical philology at Brown University, he served as research fellow in digital projects in the history of science at Harvard University and the Max Planck Institute for the History of Science. He co-founded the Sanskrit Library, collaborated with Scarf in the NSF project to integrate Sanskrit texts, lexical resources, and linguistic software and served as the Library’s technical director until his untimely death in 2009.

About the Book

In Linguistic Issues in Encoding Sanskriti Scharf and Hyman examine fundamental issues in the coding of natural language texts. The over-arching issue concerns the relation information selected for encoding bears to natural language structure. Should written characters or speech sounds be encoded? Should segments or features be encoded? What criteria should be used to contrast items selected for encoding? The book stems from the recognition that current uses of information technology demand higher standards of encoding than the inherited systems in current use. Guided by visual factors, current encoding systems reproduce deficiencies inherent in traditional writing systems. Scharf and Hyman consider more relevant information-processing principles suitable for the contemporary use of computers for the manipulation of linguistic and textual data. The book focuses on Sanskrit, which is characterized by an extensive oral tradition, a highly phonetic orthography, and a copious literature.

Foreword By George Cardona

Questions surrounding the encoding of speech have been considered since scholars began to consider the history of different writing systems and of writing itself. In modern times, attention has been paid to such issues as standardizing systems for portraying in Roman script the scripts used for recording other languages, and this has given rise to discussions about distinctions such as that between transliteration and transcription. In recent times, moreover, the advent and general use of digital technology has allowed us not only to replicate with relative ease details of various scripts and to produce machine searchable texts but also to reproduce images of manuscripts that can be viewed and manipulated, a true boon to philologists in that they are thus enabled to consult and study materials with all the details found in original manuscripts, such as different hands that can be discerned and clues to modifications made due to features of different scripts. At the source of such endeavors lie the facts of language: phonological and phonetic matters that scripts portray with various degrees of fidelity.

India can justifiably lay claim to being the home of what is doubt less the most thorough and sophisticated consideration of speech production, phonetics, and phonology in ancient times. The preservation of Vedic texts and their proper recitation according to the norms of van groups of recites led to the early analysis of continuously recited Es (scmhitapatha) into constituents — called pada — characterized b characterized alternations that appear at word boundaries, including boundaries before particular morphemes within syntactic words. A text that includes such elements is termed padapatha. At least one such analyzed text predates the grammarian Pãnini, the padapatha to the Rig-Veda by akalya. The padapAha related to any sathhitapaha obviously derives from the latter, its source. On the other hand, the separate padas of the padapatha can be viewed theoretically as the source of the continuously recited text, gotten by removing pauses at boundaries and thereby applying phonological rules that take effect between contiguous units. This is, in fact, the theoretical stance taken by authors of texts called pratiQikhya, which formulate phonological rules modifying padas in contiguity with other padas. Thus, phonological alternations within Vedic texts were objects of concern by at least the early sixth century B.C. Pãi:iini himself — who can hardly be dated later than around 500 B .C. — composed a generalized grammatical work, his abdãnuãsana, which includes both a set of rules, called Astãdhyay, serving to account through a derivational system for the accepted usage of his time and place as well as certain dialectal differences and features particular to earlier Vedic. One of the appendices to the Astadhyäy ( is an inventory of sounds — referred to as the aksarasamamnaya by early students of Panini’s work — that is divided into fourteen sets, each set off from the others by a final consonantal marker (it), which serves to form abbreviatory terms (pratyahara) referringto groups of sounds with respect to phonological rules as formulated in the Atãdhyayi.

The order of sounds in Panini’s aksarasamamnaya shows properties best explained as due to its being a reworking of an earlier source. The five sets of stops in such earlier inventories, moreover, show an obvious phonetic ordering, from velar to labial, that is, an order based on the production .of sounds, from the back of the oral cavity to the front. Moreover, pratiakhyas not only state rules of phonological replacement but also describe the production of sounds, a topic which is dealt with in works on phonetics (lilacs) such as the Apiali.ikca of Apical. Accordingly, scholars are justified in maintaining that early Indian texts reflect a sophisticated investigation of Sanskrit phonology and phonetics. Scholars have also frequently debated whether or not writing played a role in the composition and transmission of such early works as the pratiakhyas and the Atadhydyi. There can be no doubt whatever that the latter was later transmitted orally It is also most plausible that Panini himself composed and transmitted his work orally. Thus, Panini forumlutes a group of rules identifying certain sounds as markers, given the class name it, and provides that such sounds are unconditionally deleted before any other operations apply. Had he transmitted his work in writ lag, thus being able to make use of script particularities such as placing given sounds above or below a line, Panini would not have needed such in les. That works such as Pãnini’s were transmitted orally does not mean, however, that the society in which Pãnini lived was not literate. To the contrary, he lived in a part of the subcontinent — Salvatore in the extreme north-west — that at his time was under Persian control, and the Achemenid rulers had inscriptions recorded. Nevertheless, a literate society does not imply necessarily that compositions must be put in writing and thus transmitted; later Indian traditions, for example, stress the oral transmission, though writing was clearly known then. The earliest attested written documents on the subcontinent, nevertheless, come several centuries after Paine. These are the inscriptions of the emperor Mocha in the third century B.C., which for the most part employ two scripts: Brahmin everywhere except the northwest, where Kharosthi is used; in the extreme-north-west, one finds also Aramaic and Greek used.

Peter M. Scarf and the late Malcolm D. Hyman have written a valuable work, Linguistic Issues in Encoding Sanskrit, in which Sanskrit and its Systems of description and transmission serve as a background to more general discussions concerning encoding of language. The authors explain the need for a work-such as this and set forth their general aims in the introduction (p. 2) as follows:

Today people use computers to manipulate linguistic and textual data in sophisticated ways; yet current encoding systems tend to reflect visual and orthographic design factors to the exclusion of more relevant information-processing principles. Thus these systems reproduce deficiencies inherent in the traditional orthographies themselves. In this book we examine some fundamental issues in the coding of natural language texts. We consider above all the relation the information selected for encoding bears to natural language structure. We focus on Sanskrit, which is characterized by an extensive oral tradition, a highly phonetic orthography, and a copious literature. We survey various Sanskrit encoding schemes in past and present use and investigate their suitability for particular applications. We conclude by advancing some concrete proposals.

Although this book centers on Sanskrit, it covers a great many important issues and history relative to the general subject of encoding. The second and third chapters take up different coding systems. A brief sketch of the history of Indian printing serves as a background to presenting coding systems, including Roman transliterations, keyboard arrangements, and Unicode. These are subjected to a critique that centers on issues of ambiguity and redundancy consequent to their being based on Devanagari and Roman transliteration. The fourth chapter may well be the most important one from a theoretical viewpoint. Here the authors take up what they deem to be the basis for encoding. Their discussion is organized around three axes, as follows (p. 47): Axis I: Graphic—phonetic: Is the basic unit of the encoding a written character or a speech sound? Axis II: Synthetic—analytic: Are units encoded as a single Gestalt? Or are they decomposed into distinctively encoded features? Axis III: Contrastive—non-contrastive: Are code points selected only for units that contrast minimally (graphemes or phonemes)? The sixth and seventh chapters deal with the basic issue of encoding elements of speech or writing. The discussion of distinctive elements in chapter six is particularly wide ranging and includes succinct presentation of issues in areas such as generative grammar and historical linguistics. Given that the principal emphasis throughout is on Sanskrit, it is appropriate that these discussions are preceded, in chapter five, by considerations of Sanskrit phonetics and phonology. These include both presentations of what was said in various pratiakhyas and ixias — including treatments of these statements by modem scholars — and feature analysis (section 5.2.6).

In the eighth and final chapter, the authors emphasize that, since computers now are used to carry out many tasks in addition to displaying data, this can riot longer be considered the primary factor in determining a scheme for encoding. Instead, “... language should be encoded in such a way as to facilitate automatic processing, to minimize extrinsic ambiguity and redundancy, and to ensure longevity (p. 113).” Scarf and Hyman then go on to discuss what they call dynamic transcending as welt as possibilities concerning text-to-speech and speech-recognition and higher-level encoding.

The main text is complemented by a series of appendixes, four of which directly concern encoding. The first of these contains thirteen tables, in which are treated not only Sanskrit phonetic and phonological Features but also, interestingly, reconstructions of Proto-Indo-European )I1onoIogy according to different scholars. The second, third and fourth appendixes concern encoding schemes developed within the context of 11w Sanskrit Library established as a website by Scharf: the Sanskrit Li[nary Phonetic basic encoding scheme, the Sanskrit Library segmental encoding scheme, and the Sanskrit Library phonetic featural encoding scheme.

liven this brief overview should show that Linguistic issues in Encoding ,’ Sanskrit is a rich and varied work that deserves the serious attention not merely of Sanskritists but of scholars working in several areas related In language encoding.


The current generation is witnessing a transition in the dominant medium of knowledge transmission from print to electronics. The transition began in America and Western Europe but is quickly spreading around the world. Naturally due to the region of its origin, conventions in the new digital medium have been dominated by the conventions of modem Western European languages. While these conventions are making some adjustments to suit the diversity of the world’s cultures, the world is likewise quickly adapting to prevalent standards, and these standards are quickly becoming entrenched. That which doesn’t fit the standards is in danger of being left behind. History has shown that in previous media transitions the knowledge that fails to adapt to the new medium recedes from public view to the restricted domain of the endeavoring antiquarian research scholar or becomes irretrievably lost. Yet the digital medium is flexible and powerful; it has the potential not only to adequately mimic the printed medium but to exceed it by innovative software design and interactivity. The current book — and indeed much of the work of the authors including the Sanskrit Library itself — is motivated by the desire to minimize the loss of access to the knowledge of the vast heritage of ancient India in the current media transition, to facilitate innovation in the digital medium to make that knowledge more readily accessible, and to inspire those who discover it to integrate that knowledge into the dominant stream of education and culture. We believe that the insights we have gained working to make Sanskrit more accessible should be of use in making other major culture-bearing languages of the world more accessible as well. Some of these insights should be useful in the communication of knowledge in the digital medium in general.

Sanskrit text has been moving into the digital medium. Recent decades have witnessed the growth of machine-readable Sanskrit texts in archives such as the Thesaurus Indogermanischer Text-und Sprachmaterialien (TITUS), Kyoto University, Ideology, and the Gottingen Register of Electronic Texts in Indian Languages (GRETIL). The last few years have witnessed a burgeoning of digital images of Sanskrit manuscripts and books hosted on-line. For example, the University of Pennsylvania Library, which houses the largest collection of Sanskrit manuscripts in the Western Hemisphere, has made digital images of two hundred ninety- seven of them available on the web. The Universal Digital Library, and Google Books have made digital images of large numbers of Sanskrit texts accessible as part of their enormous library digitization projects. Digitized Sanskrit documents include machine-readable text and images of lexical resources such as those of the Cologne Digital Sanskrit Lexicon project (CDSL), and the University of Chicago’s Digital Dictionaries of South Asia project (DDSA).

As oral, manuscript, and print media that have conveyed the knowledge embodied in the ancient Sanskrit language make their transition into digital media, a number of scholars have begun collaborating in the Sanskrit Computational Linguistics Consortium which has organized several symposia since 2007. Members include linguists finding new challenges in formalizing the syntax of a free-word-order language, computer scientists drawn to model techniques of generative grammar used by the ancient India grammarian Panini, philologists using digital methods to assist in critical editing, and scholars collaborating to build corpora, databases, and tools for the use of academic researchers and commercial enterprises. The authors of the present volume have actively participated in and fostered this growing collaboration.

Since I 999,we have worked together to facilitate the entry, linguistic processing, and display of Sanskrit texts both in print and on the Web. Our collaboration began with the preparation of the web and print publication of Scarf’s (2002) Ramopãkhyand and the launch of The Sanskrit Library website’ in 2002, and continued with the International Digital Sanskrit Library Integration project at Brown. University under grants from the National Science Foundation (NSF) 2006—2009. In July 2009 we began the project Enhancing Access to Primary Cultural Heritage Materials of India under a grant from the National Endowment for the Humanities, and in July 2010 we began the project Sanskrit Lexical Sources: Digital Synthesis and Revision. Struggling to overcome the lack of adequate encoding for Sanskrit led us to tackle the issue both practically and theoretically. With colleagues worldwide, we prepared a proposal to extend the Unicode Standard to allow adequate encoding of Vedic Sanskrit. Simultaneously, we engaged in a thorough review of the fundamental principles of encoding. We reviewed encoding principles not just for Sanskrit and not just in digital character encoding, but considered the question broadly in terms of the means that humans communicate knowledge through speech, writing, print, and electronic media. The present volume is a result of these investigations. While the Linguistic material discussed is drawn primarily from Sanskrit, the questions addressed are relevant to linguistic encoding in general and should be of interest to scholars of linguistics.

On the fifth of September 2009, I received a call from my colleague and co-author Malcolm Hyman’s wife informing me that he had passed away suddenly the night before. It is regrettable that he did not get to see the publication of this book that has been nearly complete for two years and that he himself was primarily responsible for typesetting. It is far more regrettable that the fruitful collaboration that we have undertaken in the past decade has come to an end, and that the potential contributions he had to make will not materialize. Malcolm had a comprehensive view of digital humanities and prescient vision of productive directions for research. I am grateful for what I have learned from him in the course of our work together — even in being forced to learn TEX to bring this book to completion. In tribute to him and in the hope that others may find his work instructive and inspiring, his complete curriculum vitae is included in Appendix E of this volume.

Part of this work was supported by the NSF under grant no. 0535207. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of the NSF.


Fore word by George Cardona V
1 Introduction1
1.1 Technologies for representing spoken language2
1.2 The Sanskrit language8
1.3 The Devanagari script9
1.3 Roman transliteration16
1.5 The All-India Alphabet18
2 ExistIng encoding systems for Sanskrit21
2.1 A brief history of Indian printing21
2.2 Legacy systems: before standards25
2.4 ISCII29
2.5 Unicode: Indic scripts30
2.6 CS (Classical Sanskrit) and CSX (Classical Sanskrit Extended)32
2.7 rrus Indological 8-bit Encoding33
2.8 Umcode: Indic transliteration34
2.9 7-bit meta-transliterations35
2.10 Veithuis transliteration and ITRANS .36
2.11 wx37
2.12 Kyoto-Harvard37
2.13 VarnamalA38
3 Critique of encoding systems seen so far41
3.1 Ambiguity and redundancy42
3.2 Ambiguity in the encoding of accentuation45
4 The basis for encoding: a reanalysis47
4.1 Axis I: Spoken communication is prior to written .48
4.2 Axis II: General remarks on the units of spoken and written language52
4.2.1 Segments52
4.2.2 Features53
4.3 Axis III: What is relevant for encoding956
4.4 Encoding Sanskrit language vs. Devanagari script .57
5 Sanskrit phonology61
5.1 Description of Sanskrit sounds62
5.2 Phonetic and phonological differences65
5.2.1 Phonetic differences65
5.2.2 Sounds of problematic characterization68
5.2.3 Differences in phonological classification of segments71
5.2.4 Differences in the system of feature classification73
5.2.5 Indian treatises on phonological features73
5.2.6 Modern feature analysis75
6 Sound-based encoding79
6.1 Criteria for se1eting distinctive elements to encode79
6.1.1 Phoneme80
6.1.2 Generative grammar84
6.1.3 Historical linguistics85
6.1.4 Paralinguistic semantics .87
6.1.5 Contrastive segments89
6.1.6 Phoneme in the broader sense91
6.1.7 Contrastive phonologies92
6.2. Higher-order protocols93
6.2.1 The phonetic encoding schemes98
7 Script-based encoding101
7.1 Featural analysis103
7.2 Analysis of Devanagar script108
7.3 Component analyses of Devanãgari script109
8 Conclusions113
8.1 Dynamic transcoding117
8.2 Text-to-Speech and speech-recognition118
8.3 Higher-level encoding119
A Tables123
A.1 Phonetic features124
A.2 Sounds categorized by Apiali126
A.3 Sounds categorized by Saunaka128
A.4 Sounds categorized after Halle et al130
A.5 Sanskrit phonetics132
A.6 Sanskrit phonetics according to Apia1i134
A.7 Sanskrit phonetics according to Saunaka136
A.8 Sanskrit phonemics138
A.9 Sanskrit sounds derived from PIE by Burrow140
A.10 PIE phonemics according to Burrow142
A.11 PIE phonemics according to SzemeréflYi144
A.12 Feature tree after Halle146
A.13 Graphic features of Devanägari according to Ivanov and Toporov148
B Sanskrlt Library Phonetic Basic151
B.1 Basic Segments152
B.2 Punctuation153
B.3 Modifiers153
B.3.1. Stricture153
B.3.2 Length153
B.3.3 Accent154
B.3.4 Nasalization154
B.4 Modifier combinations and usage notes .154
B.4.1 Stricture154
B.4.2 Length155
B.4.3 Surface accent155
B.4.4 Syllabified visarga and anusvAra accent156
B.4.5 Nasals156
C Sanskrit Library Phonetic Segmental159
D Sanskrit Library Phonetic Featural205
E Malcolm D. Hyman215
E.1 A Memoir by Phoebe Pettingell215
E.2 Curriculum Vitae221
