NAME

Alvis::NLPPlatform::NLPWrapper - Perl extension for the wrappers used for linguistically annotating XML documents in Alvis

SYNOPSIS

use Alvis::NLPPlatform::NLPWrappers;

Alvis::NLPPlatform::NLPWrappers::tokenize($h_config,$doc_hash);

DESCRIPTION

This module provides defaults wrappers of the Natural Language Processing (NLP) tools. These wrappers are called in the ALVIS NLP Platform (see Alvis::NLPPlatform).

Default wrappers can be overwritten by defining new wrappers in a new and local UserNPWrappers module.

METHODS

tokenize()

tokenize($h_config, $doc_hash);

This method carries out the tokenisation process on the input document. $doc_hash is the hashtable containing containing all the annotations of the input document.

The tokenization has been written for ALVIS. This is a task that depends largely on the choice made as to what tokens are for our purpose. Hence, this function is not a wrapper but the specific tokenizing tool itself. Its input is the plain text corpus, which is segmented into tokens. Tokens are in fact a group of characters belonging to the same category. Below is a list of the four possible categories:

  • alphabetic characters (all letters from 'a' to 'z', including accentuated characters)

  • numeric characters (numbers from '0' to '9')

  • space characters (carriage return, line feed, space and tab)

  • symbols: all characters that do not fit in the previous categories

During the tokenization process, all tokens are stored in memory via a hash table (%hash_tokens).

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

The method returns the number of tokens.

scan_ne()

scan_ne($h_config, $doc_hash);

This method wraps the default Named entity recognition and tags the input document. $doc_hash is the hashtable containing containing all the annotations of the input document. It aims at annotating semantic units with syntactic and semantic types. Each text sequence corresponding to a named entity will be tagged with a unique tag corresponding to its semantic value (for example a "gene" type for gene names, "species" type for species names, etc.). All these text sequences are also assumed to be equivalent to nouns: the tagger dynamically produces linguistic units equivalent to words or noun phrases.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

We integrated TagEn (Jean-Francois Berroyer. TagEN, un analyseur d'entites nommees : conception, developpement et evaluation. Universite Paris-Nord, France. 2004. Memoire de D.E.A. d'Intelligence Artificielle), as default named entity tagger, which is based on a set of linguistic resources and grammars. TagEn can be downloaded here: http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/TagEN.tar.gz

word_segmentation()

word_segmentation($h_config, $doc_hash);

This method wraps the default word segmentation step. $doc_hash is the hashtable containing containing all the annotations of the input document.

We use simple regular expressions, based on the algorithm proposed in G. Grefenstette and P. Tapanainen. What is a word, what is a sentence? problems of tokenization. The 3rd International Conference on Computational Lexicography. pages 79-87. 1994. Budapest. The method is a wrapper for the awk script implementing the approach, has been proposed on the Corpora list (see the achives http://torvald.aksis.uib.no/corpora/ ). The script carries out Word segmentation as week the sentence segmentation. Information related to the sentence segmentation will be used in the default sentence_segmentation method.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

In the default wrapper, segmented words are then aligned with tokens and named entities. For example, let ``Bacillus subtilis'' be a named entity made of three tokens: ``Bacillus'', the space character and ``subtilis''. The word segmenter will find two words: ``Bacillus'' and ``subtilis''. The wrapper however creates a single word, since ``Bacillus subtilis'' was found to be a named entity, and should thus be considered a single word, made of the three same tokens.

sentence_segmentation()

sentence_segmentation($h_config, $doc_hash);

This method wraps the default sentence segmentation step. $doc_hash is the hashtable containing containing all the annotations of the input document.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

The sentence segmentation function does not invoke any external tool ( See the word_segmentation() method for more explaination.) It scans the token hash table for full stops, i.e. dots that were not considered to be part of words. All of these full stops then mark the end of a sentence. Each sentence is then assigned an identifier, and two offsets: that of the starting token, and that of the ending token.

pos_tag()

pos_tag($h_config, $doc_hash);

The method wraps the Part-of-Speech (POS) tagging. $doc_hash is the hashtable containing containing all the annotations of the input document. It works as follows: every word is input to the external Part-Of-Speech tagging tool. For every input word, the tagger outputs its tag. Then, the wrapper creates a hash table to associate the tag to the word. It assumes that word and sentence segmentations have been performed.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Be default, we are using the probabilistic Part-Of-Speech tagger TreeTagger (Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. New Methods in Language Processing Studies in Computational Linguistics. 1997. Daniel Jones and Harold Somers. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ ).

As this POS tagger also carries out the lemmatization, the method also adds annotation at this level step.

The GeniaTagger (Yoshimasa Tsuruoka and Yuka Tateishi and Jin-Dong Kim and Tomoko Ohta and John McNaught and Sophia Ananiadou and Jun'ichi Tsujii. Developing a Robust Part-of-Speech Tagger for Biomedical Text Proceedings of Advances in Informatics - 10th Panhellenic Conference on Informatics. pages 382-392. 2005. LNCS 3746.) can also be used, by modifying column order (see defintion of the command line in client.pl).

lemmatization()

lemmatisation($h_config, $doc_hash);

This methods wraps the default lemmatizer. $doc_hash is the hashtable containing containing all the annotations of the input document. However, as POS Tagger TreeTagger also gives lemma, this method does ... nothing. It is here just for conformance.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

term_tag()

term_tag($h_config, $doc_hash);

The method wraps the term tagging step of the ALVIS NLP Platform. $doc_hash is the hashtable containing containing all the annotations of the input document. This step aims at recognizing terms in the documents differing from named entities (see Alvis::TermTagger), like gene expression, spore coat cell. Term lists can be provided as terminological resources such as the Gene Ontology (http://www.geneontology.org/ ), the MeSH (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=mesh ) or more widely UMLS (http://umlsinfo.nlm.nih.gov/ ). They can also be acquired through corpus analysis.

The term matching in the document is carried out according to typographical and inflectional variations. The typographical variation requires a slight preprocessing of the terms.

We first assume a less strict use of the dash character. For instance, the term UDP-glucose can appear in the documents as UDP glucose and vice versa. The inflectional variation requires a lemmatization of the input documents. It makes it possible to identify transcription factors from transcription factor. Both variation types can be taken into account altogether or separately during the term matching. Previous annotation levels, such as lemmatisation and word segmentation but also named entities, are required.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Canonical forms and semantic tags which can be provided with the term tagger and associated to the terms are taken into account. Canonical forms are associated to the terms. Semantic tags are added at the semantic features level. Semantic tags can be considered as a path in a ontology. Each dot or slash characters are considered as a separator of the node identifiers.

syntactic_parsing()

syntactic_parsing($h_config, $doc_hash);

This method wraps the default sentence parsing. It aims at exhibiting the graph of the syntactic dependency relations between the words of the sentence. $doc_hash is the hashtable containing containing all the annotations of the input document.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

The Link Grammar Parser (Daniel D. Sleator and Davy Temperley. Parsing {E}nglish with a link grammar. Third International Workshop on Parsing Technologies. 1993. http://www.link.cs.cmu.edu/link/ ) is actually integrated.

Processing time is a critical point for syntactic parsing, but we expect that a good recognition of the terms can reduce significantly the number of possible parses and consequently the parsing processing time. Term identification is therefore performed prior to parsing. The word level of annotation is required. Depending on the choice of the parser, the morphosyntactic level may be needed.

semantic_feature_tagging()

semantic_feature_tagging($h_config, $doc_hash)

The semantic typing function attaches a semantic type to the words, terms and named-entities (referred to as lexical items in the following) in documents according to the conceptual hierarchies of the ontology of the domain. $doc_hash is the hashtable containing containing all the annotations of the input document.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Currently, this step is not integrated in the platform.

semantic_relation_tagging()

semantic_relation_tagging($h_config, $doc_hash)

This method wraps the semantic relation identification step. $doc_hash is the hashtable containing containing all the annotations of the input document. In the Alvis project, the default behaviour is the identification of domain specific semantic relations, i.e. relations occurring between instances of the ontological concepts in the document. These instances are identified and tagged accordingly by the semantic typing. As a result, these semantic relation annotations give another level of semantic representation of the document that makes explicit the role that these semantic units (usually named-entities and/or terms) play with respect to each other, pertaining to the ontology of the domain. However, this annotation depends on previous document annotations and two different tagging strategies, depending on the two different processing lines (annotation of web documents and acquisition of resources used at the web document annotation process) that impact the implementation of the semantic relation tagging:

  • If the document is syntactically parsed, the method can exploit this information to tag relations mentioned explicitly. This is achieved through the pattern matching of information extraction rules. The rule matcher that exploits them. The semantic relation tagger is therefore a mere wrapper for the inference method.

  • In the case where the document is not syntactically parsed, the method will base its tagging on relations given by the ontology, that is to say all known relations holding between semantic units described in the document will be added, whether those relations be explicitly mentioned in the document or not.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Currently, this step is not integrated in the platform.

anaphora_resolution()

anaphora_resolution($h_config, $doc_hash)

The methods wraps the tool which aims at identifing and solving the anaphora present in a document. $doc_hash is the hashtable containing containing all the annotations of the input document. We restrict the resolution to the anaphoras for the pronoun it. The anaphora resolution takes as input an annotated document coming from the semantic type tagging, in the ALVIS format and produces an augmented text with XML tags corresponding to anaphora relations between antecedents and pronouns, in the ALVIS format.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Currently, this step is not integrated in the platform.

# =head1 ENVIRONMENT

SEE ALSO

Alvis web site: http://www.alvis.info

AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Julien Deriviere <julien.deriviere@lipn.univ-paris13.fr>

LICENSE

Copyright (C) 2005 by Thierry Hamon and Julien Deriviere

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.