IJCNLP 2011 Tutorials
The following are the details of the tutorials that have been accepted for IJCNLP 2011:
|T1: Issues and Strategies for Interoperability among NLP Software Systems and Resources
|Nancy Ide, Vassar College, USA and
James Pustejovsky, Brandeis University, USA
|November 8, 2011 (8:30-12:00)
|Tutorial 1 Presentation (PPT)
|T2: Using Linguist's Assistant for Language Description and Translation
|November 8, 2011 (8:30-17:30)
|Tutorial 2 Presentation (ZIP)
Note: There are 4 files:
LA-tutorial.pdf: a general intro page for the participants with the schedule.
beale-demo-ijcnlp.pdf: a short paper summarizing LA's approach
Spenglatin initial elicitation information.pdf: we will use this during the first half of the demo
Grammar Intro Overview.pdf: this is a 20 page document containing the elicitation corpus we use in the initial language description process.
|T3: Modality and Negation in Natural Language Processing
|November 8, 2011 (8:30-17:30)
|Tutorial 3 Presentation (PDF)
|T4: Multidimensional Dialogue Act Annotation Using ISO 24617-2
|November 8, 2011 (14:00-17:30)
|Tutorial 4 Presentation (PPT)
Tutorial 4 Paper (PDF)
T1: Issues and Strategies for Interoperability among NLP Software Systems and ResourcesNancy Ide
(Vassar College, USA)
(Brandeis University, USA)
Advances in the field of NLP research have been hampered by the lack of interoperability among tools and data formats. This has become especially evident over the past 15 years as the use of large-scale annotated corpora has become central to empirical methods for developing language models to drive NLP applications. In order to use much of the annotated data that has recently become available for NLP research, researchers have expended considerable time and effort to massage this data into formats that are compatible with their own or other third party software. Worse, this work is often repeated by each lab or team that uses the data in order to accommodate particular requirements, resulting in an unnecessary waste cycle of time and effort. Recognizing this, the NLP community has recently turned its attention to devising means to achieve interoperability among data formats and tools.
This tutorial is intended to provide a survey of the issues involved in achieving interoperability among NLP resources and tools and introduce some of the strategies and standards that represent the state-of-the-art for achieving interoperability. The tutorial will begin by defining interoperability, a term that is often used in broad and/or vague terms but rarely defined in precise and usable terms, and consider syntactic interoperability and semantic interoperability. Syntactic interoperability relies on specified data formats, communication protocols, and the like to ensure communication and data exchange, whereas semantic interoperability exists when two systems have the ability to automatically interpret exchanged information meaningfully and accurately in order to produce useful results via deference to a common information exchange reference model. For language resources, we can define syntactic interoperability as the ability of different systems to process (read) exchanged data either directly or via trivial conversion. Semantic interoperability for language resources can be defined as the ability of systems to interpret exchanged linguistic information in meaningful and consistent ways.
With these definitions as a basis, the tutorial will go on to consider the various levels at which interoperability can be addressed. At the most basic level, common representation formats?or means to trivially render the data in a common format?must be developed, together with some means to ensure semantic consistency among the information to be processed by different systems. Higher-level concerns involve strategies for the delivery and use of NLP applications such as pipelines of web services (e.g., the model of the Language Grid), where protocols for communication among services must also be developed.
We will next provide an introduction to the ISO Linguistic Annotation Framework (LAF) and its Graph Annotation Format (GrAF), which represent a state-of-the-art approach to achieving syntactic interoperability for language resources. The considerations that led to its design and implementation will be discussed. A full-scale implementation in a major corpus, together with a demonstration of tools that transduce GrAF to other widely-used formats, will be presented. We will also demonstrate the seamless use of resources rendered in GrAF within and across major annotation applications, in particular, the General Architecture for Text Engineering (GATE) and the Unstructured Information Management Architecture (UIMA).
The final segment of the tutorial will deal with the most difficult aspect of interoperability for NLP: semantic interoperability. We will discuss the issues and concerns for semantic interoperability and overview a strategy for defining an interoperable annotation scheme. We will provide examples, including schemes developed from scratch as well as transduction of existing schemes. We will also discuss and demonstrate the use of a common set of reference categories, as implemented in the ISOCat data category registry.
Biographical information of presenters
Nancy Ide, Professor
Nancy Ide is Professor of Computer Science at Vassar College, where she teaches courses in Language Theory and Automata, Theory of Computation, Compiler Design, and Computational Linguistics.? She has been an active researcher in the field of computational linguistics for over 25 years and has published copiously on topics including computational lexicography, word sense disambiguation, semantic web implementations of linguistically annotated data, and standards for representing language resources and inter-linked layers of linguistic annotations. In 1987, she founded the Text Encoding Initiative, which continues to be the major XML format for representing heavily annotated humanities data and which in its first instantiation in SGML served as a foundation for the later development of XML. In the early and mid-1990s she was project leader for two major European projects involving the creation and annotation of large-scale, multi-lingual corpora and the design and implementation of pipeline architectures for language processing tools. In this context she developed the XML Corpus Encoding Standard (XCES), which is still a standard in the field. She has been Principal Investigator on several National Science Foundation-funded projects, including most recently a major effort to create a massive linguistically-annotated corpus of American English, the American National Corpus, and the Manually Annotated Sub-Corpus (MASC), which is the first large corpus covering diverse genres that includes annotations for a variety of linguistic phenomena. Dr. Ide serves on several sub-committees and working groups in the International Standards Organization (ISO) Technical Committee on Language Resource Management, and is the working group convener and principal architect of the ISO Linguistic Annotation Framework. She is currently Principal Investigator of SILT (Sustainable Interoperability for Language Technology), a major project funded by the US National Science Foundation to address interoperability for NLP data and tools. Dr. Ide is the Editor of the Springer journal Language Resources and Evaluation, one of the premier journals in the field of computational linguistics. She is also co-editor of the Springer book series entitled Text, Speech, and Language Technology, which contains over 30 books on topics covering the full range of computational linguistics research. She is the co-founder and President of the Association for Computational Linguistics special interest group on Annotation (ACL-SIGANN). She also serves as advisor and evaluator for several European agencies and projects. She is a strong advocate for open resources for NLP.
James Pustejovsky, Professor
James Pustejovsky has been involved in consensus-building activities for the representation and access to linguistic annotations for the past several years. In 2003, he led of development for TimeML, a specification language for events and temporal expressions in natural language developed in the context of three AQUAINT workshops and projects. He also oversaw development of the TimeBank corpus, an illustration and proof of concept of the TimeML specifications, and a graphical annotation tool (TANGO) for dense annotation. He is currently head of a working group within ISO/TC 37/SC 4 to develop a Semantic Annotation Framework, and is the author of the ISO draft specifications for time annotation (SemAF- Time) and space annotation (SemAF-Space). Professor Pustejovsky was PI of NSF-CRI-0551615, ?Towards a Comprehensive Linguistic Annotation of Language.? This project investigated the issues involved in merging several diverse linguistic annotation efforts into a unified representation, including PropBank, NomBank, the Discourse Treebank, TimeBank, and the University of Pittsburgh Opinion Corpus. Each of these is focused on a specific aspect of the semantic representation task: semantic role labeling, discourse relations, temporal relations, etc., and has reached a level of maturity that warrants a concerted effort to merge them into a single, unified representation, a Unified Linguistic Annotation (ULA). The project annotated a common 550K word corpus, where individual annotations are kept separate in order to make it easy to produce alternative annotations of a specific type of semantic information (e.g., word senses, or anaphora) without need to modify the annotation at the other levels. There are several technical and theoretical issues that must be resolved to bring different annotation layers together seamlessly. Under Professor Pustejovsky?s direction, the ULA project developed the Xbank Browser, which addresses many of the technical issues involved in merging annotations, and visualizes the commonalities and differences between two or more annotation schemes in order to further study both technical and theoretical issues. Currently, Professor Pustejovsky is Co-Principal Investigator of SILT (Sustainable Interoperability for Language Technology), a major project funded by the US National Science Foundation to address interoperability for NLP data and tools. In this role he has convened several major meetings addressing fundamental issues of interoperability for NLP.
Some of the materials in this tutorial overlap with a tutorial at LREC 2010 on the Linguistic Annotation Framework, although there have been significant updates to the material to reflect the final form of the LAF standard.
T2: Using Linguist's Assistant for Language Description and TranslationStephen Beale
(University of Maryland, Baltimore County)
The Linguist's Assistant (LA) is a practical computational paradigm for efficiently and thoroughly describing languages. LA contains a semantic-based elicitation corpus that is the starting point and organizing principle from which a linguist describes the surface forms of a language using a visual lexicon and grammatical rule development interface. The resulting computational description can then be used in our document authoring and translation applications.
LA has been used to produce extensive grammars and lexicons for English, Korean, Jula (a Niger-Congo language), Kewa (Papua New Guinea) and North Tanna (Vanuatu). The resulting computational resources have been used to produce high-quality translations in each of these languages. For documentation and demonstration videos of LA visit: http://ilit.umbc.edu/sbeale/LA/
This IJCNLP tutorial will introduce the LA language documentation paradigm to the participants. We will briefly discuss the goals of LA and the methodologies employed to achieve those goals. The bulk of the class time will consist of a practical tutorial in the use of LA. We will briefly discuss setting up LA for Unicode-based orthographies. Class participants will then follow a common example to set up and populate a lexicon and develop phrase structure and surface form creation rules. We will then go over examples of more complex deep structure adjustment rules. The instructor will demonstrate how to add naturally occurring texts to the initial elicitation corpus. Finally, we will use the computational language descriptions produced by LA to translate a short medical text.
Biographical information of presenters
Stephen Beale is an Honors College Fellow at the University of Maryland. Dr. Beale received his PhD in Computer Science and Language Technologies from Carnegie Mellon University and has been involved in NLP research for twenty years. He also has considerable experience in practical linguistic fieldwork, having worked with the Summer Institute of Linguistics in Thailand and Vanuatu. He is co-developer of LA and has used LA to describe several languages and translate community development and Old Testament texts.
T3: Modality and Negation in Natural Language ProcessingRoser Morante
(Computational Linguistics and Psycholinguistics (CLiPS)
Modality and negation are ubiquitous phenomena in language. Generally speaking, modality is a grammatical category that allows to express aspects related to the speaker's attitude towards her statements in terms of degree of certainty, reliability, and subjectivity. In this tutorial modality is understood in a broad sense, which involves related concepts like subjectivity, hedging, evidentiality, uncertainty, committed belief, and factuality. Negation is a grammatical category that allows to change the truth value of a proposition. Modality and negation are treated together because they are interrelated phenomena and are protypically expressed by linguistic devices that share some formal characteristics. For example, modality and negation cues function as operators that scope over certain parts of the sentence.
From a natural language processing perspective, a very relevant aspect of modality and negation is that they encode extra-propositional aspects of meaning. While traditionally most research has focused on propositional aspects of meaning, the interest in processing extra-propositonal aspects has grown in recent years, as a natural consequence of the consolidation of areas that focus on the computational treatment of propositional aspects. Given a sentence, researchers aim at going beyond determining 'who/what does what to whom/what where and when', which would be the goal of a typical semantic role labeling or event extraction task, and are interested in finding also features such as the source, certainty level, epistemological type, truth value, and subjective aspects of the statements contained in a text. Additionally, researchers are also interested in analysing discourse level phenomena such as finding contradictions and textual entailments or modelling how the status of events changes throughout a text. Modality and negation play a main role in these phenomena.
That there is growing interest in these topics among the NLP community is reflected by a number of recent publications, the edition of the workshop 'Negation and Speculation in Natural Language Processing (NeSp-NLP 2010)', as well as the popularity of the CoNLL 2010 shared task on 'Learning to detect hedges and their scope in natural language tex't and the future publication of a special issue of the journal Computational Linguistics. Research on modality and negation has also been stimulated by the release of a number of data sets annotated with various types of information related to these phenomena.
This tutorial is divided in five modules. In Module 1, I will introduce modality and negation as devices that express extra-propositional aspects of meaning, I will define related concepts and I will show why it is interesting and complex to process them. In Module 2, I will present different categorisation schemes and annotation efforts, as well as an overview of existing resources. In Module 3, I will describe how several related tasks have been modelled and solved. I will present in detail the rule-based and machine learning approaches that have been used to solve the tasks. In Module 4, I will focus on applications that have incorporated the treatment of modality and negation, and on research that analyses the impact of processing these phenomena. The applications range from sentiment analysis to biomedical text mining. Finally, in Module 5, I will summarize achievements and point out open problems.
The tutorial does not assume attendees know anything about modality and negation, but assumes that attendees have some familiarity with natural language processing.
Biographical information of presenters
Dr. Roser Morante is a senior researcher at CLiPS, a research center associated with the Linguistics Department of the Faculty of Arts at the University of Antwerp, Belgium. She obtained her PhD in Computational Linguistics at the University of Tilburg, The Netherlands, where she also worked as a postdoctoral researcher. She is currently working on the Biograph project led by Professor Walter Daelemans, where she applies text mining techniques to extract biomedical relations from scientific texts. In the project she has worked extensively on both modality and negation. She proposed the first model of the scope finding task as a machine learning classification task and has developed systems for finding the scope of negation and hedge cues. The system that her team submitted to the CoNLL Shared Task 2010 scored first in Task 2 on finding the scope of hedge cues. She has co-oganized the Workshop on Negation and Speculation in Natural Language Processing (NeSp-NLP 2010) and she is currently a Guest Editor of the Special Issue on Modality and Negation for the journal Computational Linguistics. She has also been involved in the organization of the Workshop Advances on Bio Text Mining 2010, the SemEval 2010 shared task on Linking Events and Their Participants in Discourse, and the evaluation exercise Processing modality and negation for machine reading, a pilot task of the Question Answering for Machine Reading Evaluation at CLEF 2011. Her current research focus is on NLP tasks related to semantics and discourse.
Some parts of these tutorial (1.1, 1.3, 1.5, 3.1, 3.2, 3.3) are based on the material presented at the University Carlos III of Madrid in the framework of the 'Seminario Mavir: Introduction to Text Mining. Processing subjectivity in texts' with an audience of 40 attendees, although 50% of the material has been changed and updated.
T4: Multidimensional Dialogue Act Annotation Using ISO 24617-2Harry Bunt
(Tilburg Center for Cognition and Communication,
This tutorial introduces the new ISO standard for dialogue act annotation ISO 24617-2 and its application. This establishment of this standard forms part of the larger ISO effort to define annotation schemes in support of the construction of interoperable semantically annotated resources, called the Semantic Annotation Framework (SemAF). ISO 24617-2 builds on previous efforts such as DAMSL, DIT++, and LIRICS, and has the following features, which will be presented and discussed in the tutorial:
DiAML annotations have recently been shown to be machine learnable with high accuracy, by algorithms which construct annotations in an incremental fashion, as an utterance comes in, thus proving the feasibility of time-effective annotation. The markup language DiAML has been designed in accordance with the ISO Linguistic Annotation Framework (LAF, ISO 24612:2009), which draws a distinction between the concepts of annotation and representation. The term `annotation' refers to the linguistic information that is added to segments of language data, independent of the format in which the information is represented; `representation' refers to the format in which an annotation is rendered, independent of its content. This distinction is implemented in the DiAML definition by a syntax that specifies, besides a class of XML-based 'representation structures', also a class of more abstract 'annotation structures'. These two components are called the 'concrete' and 'abstract syntax', respectively. Annotation structures are set-theoretical constructs, made up of concepts of the underlying metamodel; the concrete syntax defines a rendering of annotation structures in XML. These concepts will be explained and their use will be explained and illustrated.
Given the increased importance for NLP research and applications of annotated corpora in general, and of semantically annotated corpora in particular, this tutorial should be relevant to anyone who takes an interest in the processing of interactive, spoken language and its semantic and pragmatic interpretation.
Part 1: (40 minutes):
Part 3: (60 minutes):
Biographical information of presenters
Prof.dr. Harry Bunt,
Harry Bunt is professor of Computational Linguistics and Artificial Intelligence at Tilburg University in the Netherlands since 1983. Prior to that he worked at Philips Research in the departments of dialogue systems. He studied physics and mathematics at the University of Utrecht, and obtained a doctorate in formal semantics at the University of Amsterdam.
He co-founded the ACL special interest group in parsing SIGPARSE, and the special interest group in computational semantics SIGSEM. He organizes the biennial conferences called "International Conference on Parsing Technology" (IWPT) since 1993. He initiated and organized the biennial conference series "International Workshops on Computational Semantics" (IWCS) since 1994. He organized a series of workshops on Interoperable Semantic Annotation (ISA series) since 2003, of which the most recent one (ISA-6) was held in Oxford, UK in January 2011.
One of his main areas of interest is the semantics and pragmatics of spoken and multimodal utterances in dialogue. He developed a theoretical framework for dialogue analysis called "Dynamic Interpretation Theory" (DIT), from which the DIT++ annotation scheme has emerged.
He is project leader for the International Organisation for Standards ISO of the effort to design an international standard for dialogue act annotation, and a co-project leader of several other ISO efforts concerned with the development of standards for various aspects of semantic annotation.
His interests and publications cover the areas of formal and computational semantics and pragmatics; semantic annotation; dialogue theory; human-computer interaction; parser and grammar design; knowledge representation; and dialogue systems. He published the monograph "Mass terms and model-theoretic semantics" (Cambridge University Press 1985), the book series "Computing Meaning, volumes 1-3" (Kluwer and Springer, 1999, 2002, 2007), two books on multimodal interaction, and four books on parsing, of which "the most recent one, Trends in Parsing Technology" (with Joakim Nivre and Paola Merlo) appeared in 2010.