README - metacpan.org


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
              UMLS::Similarity
  SYNOPSIS
      This package consists of Perl modules along with supporting Perl
      programs that implement the semantic similarity and relatedness 
      measures described by Leacock & Chodorow (1998), Wu & Palmer (1994), 
      Nguyen and Al-Mubaid  (2006), Zhong, et al. (2002), Rada, et. al. 
      (1989), Jiang & Conrath (1997), Resnik (1995), Lin (1998), Banerjee 
      and Pedersen(2002), Patwardhan and Pedersen (2006) and a simple path  
      based measure.
      UMLS::Similarity requires the UMLS::Interface module to access 
      the Unified Medical Language System (UMLS) in order to determine 
      the similarity between two UMLS concepts.
      The Perl modules are designed as objects with methods that take as
      input two concepts from the UMLS. The semantic relatedness of these 
      concepts is returned by these methods. A quantitative measure of 
      the degree to which the two concepts are related has wide ranging 
      applications in numerous areas, such as word sense disambiguation, 
      information retrieval, etc. For example, in order to determine which 
      sense of a given word is being used in a particular context, the sense 
      having the highest relatedness with its context word senses is most 
      likely to be the sense being used. Similarly, in information retrieval, 
      retrieving documents containing highly related concepts are more likely 
      to have higher precision and recall values.
      The following sections describe the organization of this software
      package and how to use it. A few typical examples are given to help
      clearly understand the usage of the modules and the supporting
      utilities.
  INSTALL
        To install these modules run:
          perl Makefile.PL
          make
          make test
          make install
        This will install the modules in the standard locations. You will, 
        most probably, require root privileges to install in standard system
        directories. To install in a non-standard directory, specify a prefix
        during the 'perl Makefile.PL' stage as:
          perl Makefile.PL PREFIX=/home
        It is possible to modify other parameters during installation. The
        details of these can be found in the ExtUtils::MakeMaker
        documentation. However, it is highly recommended not messing 
        around with other parameters, unless you know what you're doing.
  SEMANTIC RELATEDNESS
        We observe that humans find it extremely easy to say if two words are
        related and if one word is more related to a given word than another.
        For example, if we come across two words -- 'car' and 'bicycle', we know
        they are related as both are means of transport. Also, we easily observe
        that 'bicycle' is more related to 'car' than 'fork' is. But is there
        some way to assign a quantitative value to this relatedness? Some ideas
        have been put forth by researchers to quantify the concept of
        relatedness of words, with encouraging results.
        A number of different measures of relatedness have been implemented in
        this software package. These include a simple edge counting
        approach. The measures require the UMLS-Interface that define UMLS 
        concepts, and some basic relationships between these concepts.
  CONTENTS
        All the modules that will be installed in the Perl system directory are
        present in the '/lib' directory tree of the package. These include the
        semantic relatedness modules -- 
          UMLS/Similarity/lch.pm
          UMLS/Similarity/path.pm
          UMLS/Similarity/wup.pm
          UMLS/Similarity/nam.pm
          UMLS/Similarity/zhong.pm
          UMLS/Similarity/cdist.pm
          UMLS/Similarity/res.pm
          UMLS/Similarity/lin.pm
          UMLS/Similarity/jcn.pm
          UMLS/Similarity/random.pm
          UMLS/Similarity/vector.pm
          UMLS/Similarity/lesk.pm
        -- present in the lib/ subdirectory. All these modules, once installed
        in the Perl system directory, can be directly used by Perl programs.
        The package contains a utils/ directory that contain Perl utility 
        programs. These utilities use the modules or provide some supporting
        functionality.
          umls-similarity.pl         -- returns the semantic similarity of two 
                                        terms or UMLS CUIs given a specified 
                                        measure (and view of the UMLS).
          spearman.pl                -- calculates the Spearman Rank 
                                        Correlation between two files
          vector-input.pl            -- creates the matrix and index files 
                                        required for the vector measure
          SignificanceTesting.r      -- R script to calculate the correlation 
                                        between a gold standard and the results 
                                        obtained using the measures in the 
                                        umls-similarity.pl program
          sim2r.pl                   -- converts umls-similarity.pl output to 
                                        a format that can be read by the R script
          create-icfrequency.pl      -- create the frequency file for 
                                        information content measures
          create-icpropagation.pl    -- create the probability file for 
                                        information content measures
   
         vector-input.pl            -- script to create the matrix and index
                                       files for the vector measure
        The package contains a web/ directory which contains a web interface to 
        the UMLS-Similarity package once it is installed. Please see the README 
        file in the web/ directory for further information.
  CONFIGURATION
    UMLS-Interface allows information to be extracted from the UMLS given a
    specified set of sources and relations through the use of a
    configuration file.
    There are six configuration options: SAB, REL, RELA, SABDEF, RELDEF, and
    RELADEF.
    The SAB and REL options are used to determine which sources and
    relations the path information is to be obtained from. The RELA option
    narrows down the relation even further. The RELA will only be applied to
    the PAR/CHD and RB/RN relations.
    The SABDEF and RELDEF options are used to determine which sources and
    relations to use when creating the Extended Definition. The RELA option
    narrows down the relation even further. The RELADEF will only be applied
    to the PAR/CHD and RB/RN relations.
    The path, wup, lch, lin, jcn and res measures require the SAB and REL
    options to be set. There is also an optional RELA option.
    The vector and lesk measures require the SABDEF and RELDEF options to be
    set with an optional RELADEF.
    You can specify a single source, multiple sources or the entire UMLS
    (using the UMLS_ALL option). Keep in mind that the greater the number of
    sources the larger the search space so if you obtaining path information
    about two concepts this will take longer. The names of the sources in
    the configuration file are expected to be in the SAB (source
    abbreviation) form. A listing of the sources and their SABs can be
    found:
    <http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/re
    lease/source_vocabularies.html>
    You can specify any relations that exist in the specified set of sources
    that you defined. The directional (hierarchical) relations though are
    PAR/CHD and RB/RN. The other relations (such as RO and SIB) are not
    directional which means when obtaining path information when using these
    relations may take much longer than obtaining path information using the
    directional relations. A listing of the different relations can be found
    here (scroll down to the REL table):
    <http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/re
    lease/abbreviations.html>
    If you do plan on using a multiple sources or the entire UMLS, we would
    advise you to use the --realtime option which is explained below, in the
    Interface.pm documentation and the path programs in the utils/
    directory. We also have a am UMLS_ALL option for this so you do not have
    to specify each and every source and relation.
    The format of the configuration file is as follows:
    SAB :: <include|exclude> <source1, source2, ... sourceN>
    REL :: <include|exclude> <relation1, relation2, ... relationN>
    RELA :: <include|exclude> <rela1, rela2, ... relaN>
    For example, if we wanted to use the MSH vocabulary with only the RB/RN
    relations, the configuration file would be:
     SAB :: include MSH
     REL :: include RB, RN
    or
     SAB :: include MSH
     REL :: exclude PAR, CHD
    If we wanted to use the SNOMEDCT vocabulary with only the PAR/CHD
    relations that are is-a relations, the configuration file would be:
     SAB :: include SNOMEDCT
     REL :: include PAR, CHD 
     RELA :: include isa, inverse_isa
    The format for SABDEF and RELDEF is similar.
    The SABDEF and RELDEF options are used to determine the sources and
    relations the extended definition is to be obtained from.
    The format of the configuration file is as follows:
    SABDEF :: <include|exclude> <source1, source2, ... sourceN>
    RELDEF :: <include|exclude> <relation1, relation2, ... relationN>
    RELADEF :: <include|exclude> <rela1, rela2, ... relaN>
    Note: RELDEF takes any of MRREL relations and two special 'relations':
          1. CUI which refers to the CUIs definition
          2. TERM which refers to the terms associated with the CUI
    For example, if we wanted to use the definitions from MSH vocabulary and
    we only wanted the definition of the CUI and the definitions of the CUIs
    SIB relation, the configuration file would be:
     SABDEF :: include MSH
     RELDEF :: include CUI, SIB
    If you wanted only the PAR/CHD definitions which are is-a relations.
     SABDEF :: include MSH
     RELDEF :: include PAR, CHD
     RELADEF :: include isa, inverse_isa
    For all of these options, there is an UMLS_ALL tag. If used with SAB or
    SABDEF, it would include all of the UMLS sources. If used with the REL
    or RELDEF, it would include all of the possible relations (as well as
    CUI and TERM for RELDEF). If used with the RELA or RELADEF, it would
    include all of the RELA relations including those with no RELA relation.
    Note that this is also the default for this option which is why it is
    optional. An example of using the UMLS_ALL option is as follows:
     SAB :: include UMLS_ALL
     REL :: include UMLS_ALL
    and another is:
     SABDEF :: include UMLS_ALL
     RELDEF :: include UMLS_ALL
    If you go to the configuration file directory, there will be example
    configuration files for the different runs that you have performed.
    For more information about the configuration options please see the
    README.
  PROPAGATION
    The Information Content (IC) is defined as the negative log of the
    probability of a concept. The probability of a concept, c, is determine
    by summing the probability of the concept occurring in some text plus
    the probability its descendants occurring in some text.
    The following is an example of the method UMLS-Interface uses to
    propagate counts to determine the probability of a concept in the
    sources/relations specified in the configuration file. In this method,
    we percolate the counts up the hierarchy, and in the case of multiple
    inheritance, we send a full count up all the paths to the parent.
    The icfrequency file contains the frequency of the following concepts
    existing in some corpus. For example, our corpus consists of three
    concepts, A B & C, each occurring five times:
     SAB :: (include|exclude) <sources>
     REL :: (include|exclude) <relations>
     N :: 15
     A<>5
     B<>5
     C<>5
    In this case, our sources and relations consist of the following
    'graph': Notation....A->D means A is a child of D....
     A->D
     B->D
     B->E
     D->F
     C->E
     E->F
    So A B and C are "leaf" nodes and F is the root.
    Step 1: determine the descendants of each nodes
     Descendants(A) = {}
     Descendants(B) = {}
     Descendants(C) = {}
     Descendants(D) = {A, B}
     Descendants(E) = {B, C}
     Descendants(F) = {A, B, C, D, E, F}
    Step 2: determine the probability of a concept, P(c), occurring by
    summing the probability of each of descendants plus its probability.
     P(A) = freq(A) / N = .33
     P(B) = freq(B) / N = .33
     P(C) = freq(C) / N = .33
     P(D) = (freq(A)+freq(B)+freq(D)) / N = .66
     P(E) = (freq(B)+freq(C)+freq(E)) / N = .66
     P(F) = (freq(A)+freq(B)+freq(C)+freq(D)+freq(E)+freq(F)) / N = .99
    Step 3: print the probability of the concept occurring, P(c), for each
    node in the sources/relations defined in the configuration table.
     SMOOTH :: 0 <- or 1 if smoothing was used
     SAB :: (include|exclude) <sources>
     REL :: (include|exclude) <relations>
     RELA :: (include|exclude) <relas>  <- if any are specified in the config
     N :: 15
     A<>.33
     B<>.33
     C<>.33
     D<>.66
     E<>.66
     F<>.99
    The information content for the nodes is then calculated by taking -log
    of this probability.
    We have an option that incorporates Laplace smoothing. Laplace smoothing
    is where the frequency count of each of the concepts in the taxonomy is
    incremented by one. The advantage of doing this is that it avoids having
    a concept that has a probability of zero. The disadvantage is that it
    can shift the overall probability mass of the concepts from what is
    actually seen in the corpus.
  SOFTWARE COPYRIGHT AND LICENSE
        Copyright (C) 2004-2011 Bridget T McInnes, Siddharth Patwardhan, 
        Serguei Pakhomov and Ted Pedersen
        This suite of programs is free software; you can redistribute it and/or
        modify it under the terms of the GNU General Public License as published
        by the Free Software Foundation; either version 2 of the License, or (at
        your option) any later version.
        This program is distributed in the hope that it will be useful, but
        WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
        General Public License for more details.
        You should have received a copy of the GNU General Public License along
        with this program; if not, write to the Free Software Foundation, Inc.,
        59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
        Note: The text of the GNU General Public License is provided in the file
        'GPL.txt' that you should have received with this distribution.
  REFERENCING
        If you write a paper that has used UMLS-Similarity in some way, we'd 
        certainly be grateful if you sent us a copy and referenced UMLS-Interface. 
        We have a published paper that provides a suitable reference:
        @inproceedings{McInnesPP09,
           title={{UMLS-Interface and UMLS-Similarity : Open Source 
                   Software for Measuring Paths and Semantic Similarity}}, 
           author={McInnes, B.T. and Pedersen, T. and Pakhomov, S.V.}, 
           booktitle={Proceedings of the American Medical Informatics 
                      Association (AMIA) Symposium},
           year={2009}, 
           month={November}, 
           address={San Fransisco, CA}
        }
        This paper is also found in
        <http://www-users.cs.umn.edu/~bthomson/publications/pubs.html>
        or
        <http://www.d.umn.edu/~tpederse/Pubs/amia09.pdf>
  REFERENCES
        1   Wu Z. and Palmer M. 1994. Verb Semantics and Lexical Selection. In
            Proceedings of the 32nd Annual Meeting of the Association for
            Computational Linguistics.  Las Cruces, New Mexico.
        2   Resnik P. 1995. Using information content to evaluate semantic
            similarity. In Proceedings of the 14th International Joint
            Conference on Artificial Intelligence, pages 448-453, Montreal.
        3   Jiang J. and Conrath D. 1997. Semantic similarity based on corpus
            statistics and lexical taxonomy. In Proceedings of International
            Conference on Research in Computational Linguistics, Taiwan.
        4   Fellbaum C., editor. WordNet: An electronic lexical database. MIT
            Press, 1998.
        5   Leacock C. and Chodorow M. 1998. Combining local context and WordNet
            similarity for word sense identification. In Fellbaum 1998, pp.
            265-283.
        6   Lin D. 1998. An information-theoretic definition of similarity. In
            Proceedings of the 15th International Conference on Machine
            Learning, Madison, WI.
        7   Hirst G. and St-Onge D. 1998. Lexical Chains as representations of
            context for the detection and correction of malapropisms. In
            Fellbaum 1998, pp. 305-332.
        8   Schütze H. 1998. Automatic Word Sense Discrimination. Computational
            Linguistics, 24(1):97-123.
        9   Resnik P. 1999. Semantic Similarity in a Taxonomy: An Information-
            Based Measure and its Applications to Problems of Ambiguity in
            Natural Language. Journal of Artificial Intelligence Research, 11,
            95-130.
        10  Budanitsky A. and Hirst G. 2001. Semantic distance in WordNet: An
            experimental, application-oriented evaluation of five measures. In
            Workshop on WordNet and Other Lexical Resources, Second meeting of
            the North American Chapter of the Association for Computational
            Linguistics. Pittsburgh, PA.
        11  Banerjee S. and Pedersen T. 2002. An Adapted Lesk Algorithm for Word
            Sense Disambiguation Using WordNet. In Proceeding of the Fourth
            International Conference on Computational Linguistics and
            Intelligent Text Processing (CICLING-02). Mexico City.
        12  Patwardhan S., Banerjee S. and Pedersen T. 2002. Using Semantic
            Relatedness for Word Sense Disambiguation. In Proceedings of the
            Fourth International Conference on Intelligent Text Processing and
            Computational Linguistics, Mexico City.
        13  Banerjee S. Adapting the Lesk algorithm for word sense
            disambiguation to WordNet. Master Thesis, University of Minnesota,
            Duluth, 2002.
        14  Patwardhan S. Incorporating dictionary and corpus information into a
            vector measure of semantic relatedness. Master Thesis, University of
            Minnesota, Duluth, 2003.
        15  Patwardhan, S. and Pedersen T. Using WordNet Based Context Vectors 
            to Estimate the Semantic Relatedness of Concepts. In Proceedings of 
            the EACL 2006 Workshop Making Sense of Sense - Bringing Computational 
            Linguistics and Psycholinguistics Together, pp. 1-8, April 4, 2006, 
            Trento, Italy.
        16  Rada, R., Mili, H., Bicknell, E. and Blettner, M. Development and 
            application of a metric on semantic nets. In Proceedings of the 
            IEEE Transactions on Systems, Man, and Cybernetics, volume 19, 
            pages 17-30, 1989.
        17  Nguyen, H.A. and Al-Mubaid, H. New ontology based semantic 
            similarity mesaure for the biomedical domain. In Proceedings of 
            the IEEE International Conference on Granular Computing, pages 
            623-628, 2006.
  SEE ALSO
    <http://search.cpan.org/dist/UMLS-Interface>
    <http://search.cpan.org/dist/UMLS-Similarity>
  CONTACT US
    If you have any trouble installing and using UMLS-Interface, please
    contact us via the users mailing list :
    umls-similarity@yahoogroups.com
    You can join this group by going to:
    <http://tech.groups.yahoo.com/group/umls-similarity/>
    You may also contact us directly if you prefer :
      Bridget T. McInnes: bthomson at cs.umn.edu
      Ted Pedersen      : tpederse at d.umn.edu
  AUTHORS
     Bridget T McInnes, University of Minnesota Twin Cities
     bthomson at cs.umn.edu
     Siddharth Patwardhan, University of Utah
     sidd at cs.utah.edu
     Serguei Pakhomov, University of Minnesota Twin Cities
     pakh002 at umn.edu
     Ted Pedersen, University of Minnesota Duluth
     tpederse at d.umn.edu
     Ying Liu, University of Minnesota
     liux0395 at umn.edu
  DOCUMENTATION COPYRIGHT AND LICENSE
    Copyright (C) 2003-2013 Bridget T. McInnes, Siddharth Patwardhan,
    Serguei Pakhomov, Ying Liu and Ted Pedersen.
    Permission is granted to copy, distribute and/or modify this document
    under the terms of the GNU Free Documentation License, Version 1.2 or
    any later version published by the Free Software Foundation; with no
    Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
    Note: a copy of the GNU Free Documentation License is available on the
    web at:
    <http://www.gnu.org/copyleft/fdl.html>
    and is included in this distribution as FDL.txt.
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)