Panini's Karaka System for Language Processing is the outcome of Research and Development (R&D) at the Doctor of Philosophy (Ph.D.) completed from Special Centre for Sanskrit Studies, Jawaharlal Nehru University, New Delhi, India under the supervision of Dr. Girish Nath Jha. This book can be broadly categorized in five sections such as Structure of Astadhyayi, Nominal Inflection Morphology, Verbal Inflectional Morphology, Panini's Karaka System and Language Processing.
Dr. Sudhir K Mishra
Birth: 7 April 1978
Place: Poraikalan, Khetasarai, Dist- Jaunpur (UP.)
Academics: High School and Intermediate from Obra Intermediate College, Obra, Graduation and Post- Graduation from University of Allahabad in Sanskrit. Doctoral Research (Ph.D.) awarded from Special Centre for Sanskrit Studies, Jawaharlal Nehru University, New Delhi.
Books: 1. संगणक- जनित व्यवहारिक संस्कृत-धातु रुपावली 2. अष्टाध्यायी- सूत्रपाठ 3. Artificial Intelligence and Natural Language Processing (Under Publication).
Research Papers: Four papers published in Journals, 10 papers published in different National and International Proceeding.
Contact: Applied Artificial Intelligence Group, Centre for Development of Advanced Computing (C-DAC) Pune.
Preface Panini's Karaka System for Language Processing is the outcome of Research and Development (R&D) at the Doctor of Philosophy (Ph.D.) level from Special Centre for Sanskrit Studies, Jawaharlal Nehru University, New Delhi, India during 2002 to 2007 under the supervision of Dr. Girish Nath Jha. The title of the dissertation was “Sanskrit Karaka Analyzer for Machine Translation" which was submitted in July 2007. The work was based on the formulation of Paninian kakara theory and affiliated vibhakti and karmapravacaniya theories. The system is online available on the website of the Special Centre for Sanskrit Studies, Jawaharlal Nehru University (http://sanskrit.jnu.ac.in/karaka/ analyzer.jsp).
I was research student of first batch of the Special Centre for Sanskrit Studies and decided to work on Karaka Analyzer for Machine Translation after one year course work. Panini uses the term karake (Ast. 1.4.23) to refer to what brings a thing signified by a verb to accomplishment. According to Patanjali, the term karaka is a technical term (karotiti karakam) whose etymological meaning is retained (anvarthasamjna), so that it means 'that which brings about something'. The word karaka is derived from the root word dukrn karane (do, make) with the krt suffix nvul (Ast. 3.1.133). Karaka is relationship between verb (tinanta) and other constituents of the sentence (subanta). Therefore it was primary requirement to identify and analyze nominal and verbal inflectional morphology before starting the work on karaka. But the research topic was not only passed by the competent bodies but also around more than 2 years left during the process of course work and synopsis finalization. Therefore the work on verbal inflectional morphology was started and completed of only selected 438 verb roots and the work was published in 2007 with my supervisor Prof. Girish Nath Jha. The work on nominal inflectional morphology was also started but not completed due to the pressure of time limit.
This book can be broadly categorized in five dimensions such as Structure of Astadhyayi, Nominal Infection Morphology, Verbal Inflectional Morphology, Panini' Karaka System and Language Processing. The chapter, structures of Astadhyayi covers definition of sutra, types of sutra, arrangement of sutra in Astadhyayi and technical terms in Astadhyayi along with the descriptions of other texts associated with the Astadhyayi such as siva sutra. dhatupatha, ganapatha, phitsutra, unadisutra, linganusasana etc. The Nominal Inflectional Morphology covers types of nominal inflectional morphology such as avyaya, samasa, krdanta, taddhita and stripratyayanta pada. Determination of consonants and vowels ending padas are also described in all 8 vibhakties. The Verbal Inflectional Morphology covers the arrangement of verbs, lakara- Tense and Moods, terminations, terminations in parasmaipada in all 10 classes (gana) and terminations in atmanepada in all 10 classes (gana).
Panini's karaka system describes the definition of karaka, types of karaka and formulation of each karaka rule. The formulation of vartika of Katyayana also added against relevant siura. Panini's vibhakti system describes the formulation of each vibhakti sutra along with the vartika of Katyayana. A table is also added at the end of chapter showing the mapping of possibilities of maximum vibhakties in each karaka. And at the end of this chapter a list of suira also tabled to represent Vedic texts related vibhakti siura. Panini describes karmapravacaniya samjna (It is also an adhikara) in the fourth part of the first chapter of Astadhyaya (from Ast.4.82 to Ast. 1.4.97). There are three vibhaktis (dvitlya, paacami and saptami) used in the context of karmapravacaniya. Panini makes karmapravacaniya samjna of 11 nipata within above mentioned adhikara. The formulation of karmapravacaniya related rules are described in the next chapter titled Panini's karmapravacaniya system.
Sanskrit: language processing covers the introduction of natural language processing, levels of analysis of any language, process of language processing. Tokenization, Part- of-Speech Tagging and parsing is described under the process of language processing because these are required for karaka analysis. Lexical resources required for the implementation of Karaka system is methodically mentioned in the last section of this chapter. A selective and comprehensive bibliography is also included in the last section of this book.
I am greatly indebted to Prof Girish Nath Jha, under whose guidance and supervision I was completed my research in the emerging area of multidisciplinary research called 'Computational Linguistics'. Dr. Jha provided me continuous help and encouragement. I am deeply grateful to my teacher Dr. Hari Ram Mishra for initiating me in the discipline of Sanskrit (Grammar, Literature, Linguistics and Philosophy) and helping me out whenever there was a need for it.
I extend my gratitude for the financial support I received during my research by Mahatma Gandhi Antarrashtriya Hindi Vishwavidyalaya, University Grants Commission (UGC) and Microsoft India Pvt. Ltd. (MSI), who sponsored the projects Hindi Sangrah, Online Multilingual Amarakosa and Devanagari handwriting recognition for tablet PCs respectively in which I worked. The Rashtriya Sanskrit Sansthan, New Delhi provided research scholarship to me. A special thanks to the Jawaharlal Nehru University for providing opportunity and facility for the doctoral research work.
Sanskrit is the primary culture-bearing language of India, with a continuous production of literature in all fields of human endeavor over the course of four millennia. Preceded by a strong oral tradition of knowledge transmission, records of written Sanskrit remain in the form of inscriptions dating back to the first century B.C.E. Extant manuscripts in Sanskrit number over 30 million - one hundred times those in Greek and Latin combined - constituting the largest cultural heritage that any civilization has produced prior to the invention of the printing press. Sanskrit works include extensive epics, subtle and intricate philosophical, mathematical, and scientific treatises, and imaginative and rich literary, poetic, and dramatic texts. The primary language of the Vedic civilization, Sanskrit developed constrained by a strong grammatical tradition stemming from the fairly complete grammar composed by Panini by the fourth century B.C.E. In addition to serving as an object of study in academic institutions, the Sanskrit language persists in the recitation of hymns in daily worship and ceremonies, as the medium of instruction in centers of traditional learning, as the medium of communication in selected academic and literary journals, academic fora, and broadcasts, and as the primary language of a revivalist community near Bangalore. The language is one of the twenty-two official languages of India in which nearly fifty thousand speakers claimed fluency in the 1991 Indian census (Pawan Goyal et al,).
Panini, the grammarian marks a great divide in the long history of thinking about language which for convenience, can be divided into four phases of development after the pre-Paninian period-
Natural Language Processing (NLP) refers to descriptions that attempt to make the computers analyze, understand and generate natural languages, enabling one to address a computer in a manner as one is addressing a human being. Natural Language Processing is both a modern computational technology and a method of investigating and evaluating claims about human language itself. Panini has become very popular in contemporary linguistics, computational and artificial intelligence.
The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably - on the basis of the conversational content alone - between the program and a real human. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.
Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".
During the 70's many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.
Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.
As described above, modern approaches to NLP are grounded in machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language- processing tasks typically involved the direct hand coding of large sets of rules. The machine-learning paradigm calls instead for using general learning algorithms - often, although not always, grounded in statistical inference - to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned."
Many different classes of machine learning algorithms have been applied to NLP tasks. In common to all of these algorithms is that they take as input a large set of "features" that are generated from the input data. As an example, for a part-of-speech tagger, typical features might be the identity of the word being processed, the identity of the words immediately to the left and right, the part-of-speech tag of the word to the left, and whether the word being considered or its immediate neighbors are content words or function words. The algorithms differ, however, in the nature of the rules generated. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common.
Send as free online greeting card
Email a Friend