A Fast Clustering-Based
Feature Subset Selection Algorithm for High-Dimensional Data
Abstract
Feature selection
involves identifying a subset of the most useful features that produces
compatible results as the original entire set of features. A feature selection
algorithm may be evaluated from both the efficiency and effectiveness points of
view. While the efficiency concerns the time required to find a subset of
features, the effectiveness is related to the quality of the subset of
features. Based on these criteria, a fast clustering-based feature selection
algorithm (FAST) is proposed and experimentally evaluated in this paper. The
FAST algorithm works in two steps. In the first step, features are divided into
clusters by using graph-theoretic clustering methods. In the second step, the
most representative feature that is strongly related to target classes is
selected from each cluster to form a subset of features. Features in different
clusters are relatively independent, the clustering-based strategy of FAST has
a high probability of producing a subset of useful and independent features. To
ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree
(MST) clustering method. The efficiency and effectiveness of the FAST algorithm
are evaluated through an empirical study. Extensive experiments are carried out
to compare FAST and several representative feature selection algorithms, namely,
FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of
well-known classifiers, namely, the probabilitybased Naive Bayes, the
tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and
after feature selection. The results, on 35 publicly available real-world
high-dimensional image, microarray, and text data, demonstrate that the FAST
not only produces smaller subsets of features but also improves the
performances of the four types of classifiers.
Existing
System
The embedded methods
incorporate feature selection as a part of the training process and are usually
specific to given learning algorithms, and therefore may be more efficient than
the other three categories. Traditional machine learning algorithms like
decision trees or artificial neural networks are examples of embedded
approaches. The wrapper methods use the predictive accuracy of a predetermined
learning algorithm to determine the goodness of the selected subsets, the
accuracy of the learning algorithms is usually high. However, the generality of
the selected features is limited and the computational complexity is large. The
filter methods are independent of learning algorithms, with good generality.
Their computational complexity is low, but the accuracy of the learning algorithms
is not guaranteed. The hybrid methods are a combination of filter and wrapper
methods by using a filter method to reduce search space that will be considered
by the subsequent wrapper. They mainly focus on combining filter and wrapper
methods to achieve the best possible performance with a particular learning
algorithm with similar time complexity of the filter methods.
Disadvantages
·
The generality of the selected features
is limited and the computational complexity is large.
·
Their computational complexity is low,
but the accuracy of the learning algorithms is not guaranteed.
·
The hybrid methods are a combination of
filter and wrapper methods by using a filter method to reduce search space that
will be considered by the subsequent wrapper.
Proposed
System
Feature subset selection
can be viewed as the process of identifying and removing as many irrelevant and
redundant features as possible. This is because irrelevant features do not
contribute to the predictive accuracy and redundant features do not redound to
getting a better predictor for that they provide mostly information which is
already present in other feature(s). Of the many feature subset selection
algorithms, some can effectively eliminate irrelevant features but fail to handle
redundant features yet some of others can eliminate the irrelevant while taking
care of the redundant features. Our proposed FAST algorithm falls into the
second group. Traditionally, feature subset selection research has focused on
searching for relevant features. A well-known example is Relief which weighs
each feature according to its ability to discriminate instances under different
targets based on distance-based criteria function. However, Relief is
ineffective at removing redundant features as two predictive but highly
correlated features are likely both to be highly weighted. Relief-F extends
Relief, enabling this method to work with noisy and incomplete data sets and to
deal with multiclass problems, but still cannot identify redundant features.
Advantages
·
Good feature subsets contain features
highly correlated with (predictive of) the class, yet uncorrelated with (not
predictive of) each other.
·
The efficiently and effectively deal
with both irrelevant and redundant features, and obtain a good feature subset.
·
Generally all the six algorithms achieve
significant reduction of dimensionality by selecting only a small portion of
the original features.
·
The null hypothesis of the Friedman test
is that all the feature selection algorithms are equivalent in terms of
runtime.
Module
Distributed
clustering
Subset
Selection Algorithm
Time
complexity
Microarray
data
Data
Resource
Irrelevant
feature
Module
Description
1. Distributed clustering
The Distributional clustering has been used to
cluster words into groups based either on their participation in particular
grammatical relations with other words by Pereira et al. or on the distribution
of class labels associated with each word by Baker and McCallum . As distributional
clustering of words are agglomerative in nature, and result in suboptimal word
clusters and high computational cost, proposed a new information-theoretic
divisive algorithm for word clustering and applied it to text classification. proposed
to cluster features using a special metric of distance, and then makes use of the
of the resulting cluster hierarchy to choose the most relevant attributes.
Unfortunately, the cluster evaluation measure based on distance does not
identify a feature subset that allows the classifiers to improve their original
performance accuracy. Furthermore, even compared with other feature selection
methods, the obtained accuracy is lower.
2. Subset Selection Algorithm
The
Irrelevant features, along with redundant features, severely affect the
accuracy of the learning machines. Thus, feature subset selection should be
able to identify and remove as much of the irrelevant and redundant information
as possible. Moreover, “good feature subsets contain features highly correlated
with (predictive of) the class, yet uncorrelated with (not predictive of) each
other. Keeping these in mind, we develop a novel algorithm which can
efficiently and effectively deal with both irrelevant and redundant features,
and obtain a good feature subset.
3. Time complexity
The major amount of
work for Algorithm 1 involves the computation of SU values for TR relevance and
F-Correlation, which has linear complexity in terms of the number of instances
in a given data set. The first part of the algorithm has a linear time
complexity in terms of the number of features m. Assuming features are selected
as relevant ones in the first part, when k ¼ only one feature is selected.
4. Microarray data
The
proportion of selected features has been improved by each of the six algorithms
compared with that on the given data sets. This indicates that the six
algorithms work well with microarray data. FAST ranks 1 again with the
proportion of selected features of 0.71 percent. Of the six algorithms, only
CFS cannot choose features for two data sets whose dimensionalities are 19,994
and 49,152, respectively.
5. Data Resource
The purposes of
evaluating the performance and effectiveness of our proposed FAST algorithm,
verifying whether or not the method is potentially useful in practice, and
allowing other researchers to confirm our results, 35 publicly available data
sets1 were used. The numbers of features of the 35 data sets vary from 37 to 49,
52 with a mean of 7,874. The dimensionalities of the 54.3 percent data sets
exceed 5,000, of which 28.6 percent data sets have more than 10,000 features. The
35 data sets cover a range of application domains such as text, image and bio
microarray data classification. The corresponding statistical information. Note
that for the data sets with continuous-valued features, the well-known
off-the-shelf MDL method was used to discredit the continuous values.
6. Irrelevant feature
The
irrelevant feature removal is straightforward once the right relevance measure
is defined or selected, while the redundant feature elimination is a bit of
sophisticated. In our proposed FAST algorithm, it involves 1.the construction
of the minimum spanning tree from a weighted complete graph; 2. The
partitioning of the MST into a forest with each tree representing a cluster;
and 3.the selection of representative features from the clusters.
Flow
Chart
CONCLUSION
We have presented a
novel clustering-based feature subset selection algorithm for high dimensional data.
The algorithm involves removing irrelevant features, constructing a minimum
spanning tree from relative ones, and partitioning the MST and selecting representative
features. In the proposed algorithm, a cluster consists of features. Each
cluster is treated as a single feature and thus dimensionality is drastically
reduced. We have compared the performance of the proposed algorithm with those
of the five well-known feature selection algorithms FCBF, CFS, Consist, and FOCUS-SF
on the publicly available image, microarray, and text data from the four
different aspects of the proportion of selected features, runtime,
classification accuracy of a given classifier, and the Win/Draw/Loss record.
Generally, the proposed algorithm obtained the best proportion of selected
features, the best runtime, and the best classification accuracy for Naive, and
RIPPER, and the second best classification accuracy for IB1. The Win/Draw/Loss
records confirmed the conclusions. We also found that FAST obtains the rank of
1 for microarray data, the rank of 2 for text data, and the rank of 3 for image
data in terms of classification accuracy of the four different types of
classifiers, and CFS is a good alternative. At the same time, FCBF is a good
alternative for image and text data. Moreover, Consist, and FOCUS-SF are
alternatives for text data. For the future work, we plan to explore different
types of correlation measures, and study some formal properties of feature
space.