Anomaly
Detection via Online Over-Sampling Principal Component Analysis
ABSTRACT:
Anomaly detection has been an important
research topic in data mining and machine learning. Many real-world
applications such as intrusion or credit card fraud detection require an
effective and efficient framework to identify deviated data instances. However,
most anomaly detection methods are typically implemented in batch mode, and
thus cannot be easily extended to large-scale problems without sacrificing
computation and memory requirements. In this paper, we propose an online
over-sampling principal component analysis (osPCA) algorithm to address this
problem, and we aim at detecting the presence of outliers from a large amount
of data via an online updating technique. Unlike prior PCA based approaches, we
do not store the entire data matrix or covariance matrix, and thus our approach
is especially of interest in online or large-scale problems. By over-sampling
the target instance and extracting the principal direction of the data, the
proposed osPCA allows us to determine the anomaly of the target instance
according to the variation of the resulting dominant eigenvector. Since our
osPCA need not perform eigen analysis explicitly, the proposed framework is
favored
for online applications which have
computation or memory limitations. Compared with the well-known power method
for PCA and other popular anomaly detection algorithms, our experimental
results verify the feasibility of our proposed method in terms of both accuracy
and efficiency.
EXISTING SYSTEM:
The existing approaches can be divided into three categories:
1. distribution
(statistical),
2. distance
and
3. density
based methods.
Statistical approaches assume that the data follows
some standard or predetermined distributions, and this type of approach aims to
find the outliers which deviate form such distributions.
For distance-based methods, the distances between
each data point of interest and its neighbors are calculated. If the result is
above some predetermined threshold, the target instance will be considered as
an outlier.
One of the representatives of this type of approach
is to use a density based local outlier factor (LOF) to measure the outlierness
of each data instance. Based on the local density of each data instance, the
LOF determines the degree of outlierness, which provides suspicious ranking
scores for all samples. The most important property of the LOF is the ability
to estimate local data structure via density estimation. This allows users to identify
outliers which are sheltered under a global data structure
DISADVANTAGES
OF EXISTING SYSTEM:
Most distribution models are assumed
univariate, and thus the lack of robustness for multidimensional data is a
concern. Moreover, since these methods are typically implemented in the
original data space directly, their solution models might suffer from the noise
present in the data
PROPOSED SYSTEM:
PCA is a well known unsupervised
dimension reduction method, which determines the principal directions of the data
distribution. This will prohibit the use of our proposed framework for
real-world large-scale applications. Although the well known power method is
able to produce approximated PCA solutions, it requires the storage of the
covariance matrix and cannot be easily extended to applications with streaming
data or online settings. Therefore, we present an online updating technique for
our osPCA. This updating technique allows us to efficiently calculate the approximated
dominant eigenvector without performing eigen analysis or storing the data
covariance matrix.
ADVANTAGES
OF PROPOSED SYSTEM:
·
Compared to the power method or other
popular anomaly detection algorithms, the required computational costs and
memory requirements are significantly reduced, and thus our method is
especially preferable in online, streaming data, or large scale problems.
SYSTEM ARCHITECTURE:
ALGORITHMS USED:
Anomaly
Detection via Online Oversampling PCA
MODULES
1.
Cleaning Data
2.
Detecting Outliers
3.
Clustering
MODULES
DESCRIPTION
MODULE - I
Cleaning Data
The osPCA is applied for the data
set for finding the principal direction. In this method the target instance
will be duplicated multiple times, and the idea is to amplify the effect of
outlier rather than that of normal data. After that using Leave One Out (LOO) strategy,
the angle difference will be identified. In which if we add or remove one data
instance, the direction will be changed. For normal data instances this angle
difference should be smaller and for outliers this might be larger.
A set of data instances in the
original data set is taken as predefined input. This data may be contaminated
by noise and incorrect data labelling etc., This data might be error free,
because this is going to be used as training data. So the cleaning is done
using Over-Sampling Principal Component Analysis (osPCA) method. And then the
score of outlierness St is calculated for each data instances. The
smallest St value is taken as the threshold value.
MODULE
- II
Detection
This is for detecting the
outlierness of the user input. When the user giving the input to the system,
the system calculate the St value for the new input. And then
compare that new St value with the threshold value which is
calculated in earlier.
If the St value
of the new data instance is above the threshold value, then that input data is
identified as an outlier and that value will be discarded by the system.
Otherwise it is considered as a normal data instance, and the PCA value of that
particular data instance is updated accordingly.
MODULE
- III
Clustering
The training data will be selected
only by our assumption. So there is a
possibility that some outlier data may be considered as normal data in the
previous method due to our training data. So the clustering method is used to
solve this problem. The clusters are formed for input data instances and then
the outlier calculation is applied for each cluster to find the outlier
exactly.
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
ü Processor - Pentium –IV
ü Speed - 1.1
Ghz
ü RAM - 256
MB(min)
ü Hard Disk -
20 GB
ü Key Board -
Standard Windows Keyboard
ü Mouse - Two
or Three Button Mouse
ü Monitor - SVGA
SOFTWARE CONFIGURATION:-
ü Operating System : Windows XP
ü Programming Language :
JAVA
ü Java Version :
JDK 1.6 & above.
REFERENCE:
Yuh-Jye Lee, Yi-Ren Yeh, and Yu-Chiang
Frank Wang, “Anomaly Detection via Online Over-Sampling Principal Component
Analysis”, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING 2013.