Examples/README - metacpan.org


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
                  
          IN ORDER TO BECOME FAMILIAR WITH THE DecisionTree MODULE
          ========================================================
            NOTE:  This README does not talk about the examples
                   that illustrate bagging, boosting, and how to
                   use randomized trees.  Regarding those examples,
                   see the README files in the directories
                   ExamplesBagging, ExamplesBoosting, and
                   ExamplesRandomizedTrees.
(1) First run the scripts
        construct_dt_and_classify_one_sample_case1.pl
        construct_dt_and_classify_one_sample_case2.pl
        construct_dt_and_classify_one_sample_case3.pl
        construct_dt_and_classify_one_sample_case4.pl
    as they are.  The first script is for the purely symbolic case, the
    second for a case that involves both numeric and symbolic features, the
    third for the case of purely numeric features, and the last for the
    case when the training data is synthetically generated by the script
    generate_training_data_numeric.pl
    Next, try to modify the test sample in these scripts and see what
    classification results you get for the new test samples.
(2) The first script mentioned above uses the training file
    'training_symbolic.csv', the second and the third scripts listed above
    use the training file `stage3cancer.csv', and the last script named
    above uses the training data file `training.csv'.  Regarding
    these training files:
       training_symbolic.csv:    See the script
                                    'generate_training_data_symbolic.pl'
                                 regarding how this purely symbolic data
                                 is generated.
       stage3cancer.csv:   Example of a CSV training data file with both 
                           symbolic and numeric features. 
       training.csv    :   Example of a CSV training data file for the
                           purely numeric case.  Contains two classes, each
                           a Gaussian distribution in 2D.  The parameters of
                           the two Gaussians are in the file: 
                           `param_numeric.txt'
    There are two additional training data files in the directory:
          training2.csv
          training3.csv
    These are similar to the file `training.csv' in the sense that 
    they both contain two classes, each a 2D Gaussian distribution.
    The first, `training2.csv' was generated by the script
    `generate_training_data_numeric.pl ' using the parameter file
           param_numeric_strongly_overlapping_classes.txt
    and the second, `training3.csv' was generated by the same script 
    using the parameter file
           param_numeric_extremely_overlapping_classes.txt
(3) So far we have talked about classifying one test data record at a time.
    You can place multiple test data records in a disk file and classify
    them all in one go.  To see how that can be done, execute the following
    two command lines in the `examples' directory:
     classify_test_data_in_a_file.pl   training4.csv   test4.csv   out4.csv
    This script constructs the decision tree from the data in the first
    argument file and then uses it to classify the data in the second
    argument file.  The computed class labels are deposited in the third
    argument file.
    In general, the test data files should look identical to the training
    data files.  Of course, for real-world test data, you will not have the
    class labels for the test samples.  You are still required to reserve a
    column for the class label, which now must be just the empty string ""
    for each data record.  For example, the test data supplied in the
    following two calls through the files test4_no_class_labels.csv and
    test4_no_class_labels.dat does not mention class labels:
 classify_test_data_in_a_file.pl training4.csv test4_no_class_labels.csv out4.csv
(4) Let's now talk about how you can deal with features that, statistically
    speaking, are not so "nice".  We are talking about features with
    heavy-tailed distributions over large value ranges.  As mentioned in
    the HTML based API for this module, such features can create problems
    with the estimation of the probability distributions associated with
    them.  As mentioned there, the main problem that such features cause
    is with deciding how best to sample the value range.
    Beginning with Version 2.22, you have two options in dealing with such
    features.  You can choose to go with the default behavior of the
    module, which is to sample the value range for such a feature over a
    maximum of 500 points.  Or, you can supply an additional option to the
    constructor that sets a user-defined value for the number of points to
    use.  The name of the option is "number_of_histogram_bins".  The
    following script
          construct_dt_for_heavytailed.pl
    shows an example of a DecisionTree constructor with the
    "number_of_histogram_bins" option.
===========================================================================
          FOR USING A DECISION TREE CLASSIFIER INTERACTIVELY
Starting with Version 1.6 of the module, you can use the DecisionTree
classifier in an interactive mode.  In this mode, after you have
constructed the decision tree, the user is prompted for answers to the
questions regarding the feature tests at the nodes of the tree.  Depending
on the answer supplied by the user at a node, the classifier takes a path
corresponding to the answer to descend down the tree to the next node, and
so on.  To get a feel for using a decision tree in this mode, examine the
script
        classify_by_asking_questions.pl
Execute the script as it is and see what happens.
===========================================================================
     EVALUATING THE CLASS DISCRIMINATORY POWER OF YOUR TRAINING DATA
Given a training data file that contains data records and the associated
class labels, one often wants to know the quality of the data in the file.
In other words, one wants to know if a training data file contains
sufficient information to discriminate between the different classes
mentioned in the file.
Starting with Version 2.2 of the DecisionTree module, you can now run a
10-fold cross-validation test on your training data to find out how much
class-discriminatory information is contained in the data.  The following
two scripts in the Examples directory:
       evaluate_training_data1.pl
       evaluate_training_data2.pl
As these scripts show, the following class 
       EvalTrainingData
defined in the main DecisionTree module file makes it straightforward to
evaluate the class discriminatory power your data (as long as it resides in
a `.csv' file.)  This new class is is a subclass of the DecisionTree class
in the module file.
Both the `evaluate' scripts mentioned above are identical in terms of the
usage logic shown.  The first is specifically for the training data file
`stage3cancer.csv' and second for the training data files `training.csv',
`training2.csv', and `training3.csv'.  The latter three data files contain
two Gaussian classes that are increasingly overlapping.  You can see for
yourself the decreasing quality of the training data as you evaluate first
the training file `training.csv', then the training file `training2.csv',
and finally the training file `training3.csv'.
===========================================================================
                  USING THE DT INTROSPECTION CLASS
Starting with Version 2.3, you can ask the DTIntrospection class of the
module to explain the classification decisions made at the different nodes
of the decision tree.  An instance of this class can also show as to which
nodes of the tree are directly affected by a given training sample. A node
is affected directly by a training sample if the sample falls directly in
the portion of the feature space corresponding to that node. Yet another
thing that an instance of this class can show is how the influence of a
given training sample propagates in the decision tree.
Perhaps the most important bit of information you are likely to seek
through DT introspection is the list of the training samples that fall
directly in the portion of the feature space that is assigned to a node.
However, note that, when training samples are non-uniformly distributed in
the underlying feature space, it is possible for a node to exist even when
there are no training samples in the portion of the feature space assigned
to the node.  That is because the decision tree is constructed from the
probability densities estimated from the training data.  When the training
samples are non-uniformly distributed, it is entirely possible for the
estimated probability densities to be non-zero in a small region around a
point even when there are no training samples specifically in that region.
(After you have created a statistical model for, say, the height
distribution of people in a community, the model may return a non-zero
probability for the height values in a small interval even if the community
does not include a single individual whose height falls in that interval.)
That a decision-tree node can exist even where there are no training
samples in the portion of the feature space assigned to that node is an
indication of the generalization abilities of decision-tree-based
classification.
See the following three scripts in the Examples directory for how to carry
out DT introspection:
    introspection_in_a_loop_interactive.pl
    introspection_show_training_samples_at_all_nodes_direct_influence.pl
    introspection_show_training_samples_to_nodes_influence_propagation.pl
The first script places you in an interactive session in which you will be
asked for the node number you are interested in.  Subsequently, you will be
asked for whether or not you are interested in specific questions that
introspection can provide answers for. The second script descends down the
decision tree and shows for each node the training samples that fall
directly in the portion of the feature space assigned to that node.  The
third script shows for each training sample how it affects the
decision-tree nodes either directly or indirectly through the
generalization achieved by the probabilistic modeling of the data.
===========================================================================
              GENERATING SYNTHETIC TRAINING AND TEST DATA
Starting with Version 1.6, you can use the module itself to generate
synthetic training and test data.  See the script
        generate_training_and_test_data_numeric.pl
        generate_training_and_test_data_symbolic.pl
for how to generate training data for the decision-tree classifier for the
purely numeric case and for the purely symbolic case.  The data is
generated according to the information placed in a parameter file in each
case.  These files must follow certain rules regarding the declaration of
the classes, the features, the possible values for the features, etc.  An
example of such a parameter file for the numeric case is:
        param_numeric.txt
and for the symbolic case:
        param_symbolic.txt
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)