10.4225/03/58a39ec0b09c0
Muthuthanthiri Basthiyange Thilak Laksiri Fernando
Muthuthanthiri Basthiyange Thilak Laksiri
Fernando
Data Types and Measurement Scales in Data Analysis
Monash University
2017
Data types
Measurement scales
Interval scale
Ordinal scale
Random forest
Rank transform
usForest
Quantilization
Supervised learning
Unsupervised learning
2017-02-15 00:20:14
Thesis
https://bridges.monash.edu/articles/thesis/Data_Types_and_Measurement_Scales_in_Data_Analysis/4652680
Irrespective of the formal definitions of scales and
scale types given in measurement theory, data mining often assumes a
scale type for a given attribute based on superficial properties. Unless
there are only a small number of distinct values, a quantitative
attribute is usually assumed to be given in the interval scale. Based on
that assumption, many data mining algorithms use the magnitudes of the
values in calculations. This thesis shows that doing so can have serious
adverse consequences.<br> Measurement theory provides analyses to determine the scale types of
attributes. However, in data mining, those analyses are often overlooked
and analyses are performed that implicitly assume without justification
that attributes are interval scale. This may lead to two problems.
First, initial assumptions made on scale types may not be correct.
Second, calculations in data mining produce derived scales in which
scale type identification can be difficult. Thus, there is potential
both for assumptions made about the raw data and interpretations of
derived values to be incorrect. Incorrect assumptions may produce
sub-optimal results.<br> In this thesis, we show that both supervised and unsupervised learning
algorithms that assume data are given in the interval scale frequently
produce sub-optimal results when that assumption is violated. Then, we
argue that often original quantitative data may not belong to the
interval scale. <br> Operations defined for the ordinal scale are also valid for the interval
and ratio scales. Therefore, we argue that assuming quantitative data
are given in the ordinal scale is often more effective than assuming
quantitative data are given in the interval scale. The main objectives
of our research are, assessing effects of violations of the interval
scale assumption on data mining algorithms and finding effective and
efficient measures that are invariant to such violations.
<br> We discuss our experiments on measures that assume quantitative data are
given in the ordinal scale. We tried unsupervised random forest and
rank transformed data with unsupervised learning tasks: DBScan
clustering and content based multimedia information retrieval (CBMIR).
When compared with cityblock, Euclidean, cosine and Chebychev distances,
unsupervised random forest often produced lower F-Measure values and
lower precision@50 values in DBScan clustering and CBMIR respectively.
We identified the use of synthetic data in unsupervised random forest as
the reason for the lower F-Measure values and lower precision@50
values. In contrast to that, rank transformed data used with cityblock
distance often produced higher or competitive F-Measure values in DBScan
clustering and higher or competitive precision@50 values in CBMIR when
compared with cityblock, Euclidean, cosine and Chebychev distance with
original data. Cityblock, Euclidean, cosine and Chebychev distances with
original data assume quantitative data are given in the interval scale
whereas cityblock distance with rank transformed data assumes that
quantitative data are given in ordinal scale. <br> In addition to being the cause for producing lower F-Measure and
precision@50 values in DBScan clustering and CBMIR, synthetic data add
additional disadvantages to unsupervised random forest. In unsupervised
random forest, synthetic data is also the reason for its high execution
time, high memory requirements and high number of zero similarity pairs
in calculated similarity matrices. The main disadvantage of using rank
transformed data is that to estimate the ranks of previously unseen
instances requires high execution time and memory.
<br> Then we introduced a new unsupervised learning methodology, unsupervised
stochastic forest(usForest) that assume quantitative data are given in
the ordinal scale. In DBScan clustering, usForest produced higher
F-Measure values than F-Measure values produced by cityblock, Euclidean,
cosine, Chebychev distances with original data, cityblock, Euclidean,
cosine, Chebychev distances with rank transformed data and unsupervised
random forest.
<br> In contrast to unsupervised random forest, usForest does not use
synthetic data. Hence, for a given task, usForest requires less
execution time and less memory than unsupervised random forest. In
similarity matrices, usForest produces lower number of zero similarity
instance pares when compared with unsupervised random forest. In
similarity calculations, usForest produced better estimate for expected
values with lower number of trees when compared with unsupervised random
forest. When compared with rank transformed data, usForest requires
less execution time and less memory to calculate similarity values for
previously unseen data.
<br> We identified that usForest's capability to contrast nearest neighbours
of a given instance from the rest of the instances is the reason for
producing higher F-Measure values in DBScan clustering which rely on
nearest neighbour discovery. Accordingly, we predicted that algorithms
that rely on nearest neighbour identification can specially benefit from
usForest. We extended usForest for datasets that have both quantitative
and qualitative attributes. We showed that, in DBScan clustering with
datasets that have both qualitative and quantitative attributes,
extended version of usForest produced higher F-Measure values when
compared with cityblock, Euclidean, cosine, Chebychev distances and
unsupervised random forest. Then we tested usForest for K-nearest
neighbour classification which is another algorithm that rely on nearest
neighbour identification. In K-nearest neighbour classification, the
number of cases where usForest produced lower classification error
values than Euclidean distance with original data is higher than the
number of cases where Euclidean distance with original data produced
lower classification error values than usForest. Even when compared with
Euclidean distance with rank transformed data, in more cases usForest
produced lower classification error values.
<br> In our supervised learning experiments we tested popular classifiers:
artificial feed forward neural networks (ANN), naive Bayes classifier
(NBC), linear descriminant analysis (LDA), support vector machines
(SVM), K-nearest neighbour classification (KNN) and multinomial logistic
regression (MLR). We used log loss to measure the performance of ANN,
NBC, LDA and MLR that produce posterior class probabilities.
Classification error was used to measure performances of SVM and KNN
that usually predict class labels. We found that all the tested
classification algorithms, i.e. ANN, NBC, LDA, SVM, KNN and MLR assume
that quantitative data are given in the interval scale and they are
affected by violations of the interval scale assumption. In ANN, NBC,
LDA, SVM, KNN and MLR, tied rank transformed data that assume
quantitative data are given in the ordinal scale more often produced
lower error values than error values produced by ANN, NBC, LDA, SVM, KNN
and MLR with original data that assume quantitative data are given in
the interval scale.
<br> As tied rank transform requires high execution time and high memory to
estimate the ranks of previously unseen instances, we introduced
Quantilization that requires lower execution time and lower memory than
tied rank transform. In our ANN, NBC, LDA, SVM, KNN and MLR experiments,
the number of cases where Quantilization produced lower error values
than that produced by original data is greater than the number of cases
where original data produced lower error than Quantilization.<br> As discussed above, this thesis contains our experiments, results and
analysis and conclusions. The summary of the contributions of this
thesis is as follows.<br><br> 1. We showed that violations of interval scale assumption are present
very often in real world datasets and both supervised learning and
unsupervised learning algorithms are often affected by such violations. <br><br>2. In the tested algorithms, with respect to the previously mentioned
performance measures, assuming quantitative data are given in the
ordinal scale and using rank transform produces better results more
often than assuming quantitative data are given in the interval scale.<br><br>3. We introduced a new unsupervised learning methodology, unsupervised
stochastic forest (usForest) that assumes quantitative data are given in
the ordinal scale. We showed that it can produce higher F-Measure
values in DBScan clustering and good precision@50 values in CBMIR. It require less execution time and less memory than unsupervised random
forest. To produce a reliable estimate of respective expected values in
similarity calculations, usForest needs fewer trees than the number of
trees required for unsupervised random forest. usForest needs less
execution time and memory than rank transform when previously unseen
instances are used. We extended usForest for qualitative data as well.<br><br>4. We introduced Quantilization that assumes only that quantitative data
are given in the ordinal scale. Quantilization needs less execution
time and less memory than rank transform. In tested classification
algorithms, Quantilization often produced lower error than the original
data.