Document
Clustering for Forensic Analysis: An Approach for Improving Computer Inspection
ABSTRACT:
In computer
forensic analysis, hundreds of thousands of files are usually examined. Much of
the data in those files consists of unstructured text, whose analysis by
computer examiners is difficult to be performed. In this context, automated
methods of analysis are of great interest. In particular, algorithms for
clustering documents can facilitate the discovery of new and useful knowledge from
the documents under analysis. We present an approach that applies document
clustering algorithms to forensic analysis of computers seized in police
investigations. We illustrate the proposed approach by carrying out extensive
experimentation with six well-known clustering algorithms (K-means, K-medoids, Single
Link, Complete Link, Average Link, and CSPA) applied to five real-world
datasets obtained from computers seized in real-world investigations.
Experiments have been performed with different combinations of parameters,
resulting in 16 different instantiations of algorithms. In addition, two
relative validity indexes were used to automatically estimate the number of
clusters. Related studies in the literature are significantly more limited than
our study. Our experiments show that the Average Link and Complete Link
algorithms provide the best results for our application domain. If suitably initialized,
partitional algorithms (K-means and K-medoids) can also yield to very good
results. Finally, we also present and discuss several practical results that
can be useful for researchers and practitioners of forensic computing.
EXISTING SYSTEM:
Clustering algorithms are typically used for
exploratory data analysis, where there is little or no prior knowledge about
the data. This is precisely the case in several applications of Computer
Forensics, including the one addressed in our work. From a more technical
viewpoint, our datasets consist of unlabeled objects—the classes or categories
of documents that can be found are a priori unknown. Moreover, even assuming
that labeled datasets could be available from previous analyses, there is
almost no hope that the same classes (possibly learned earlier by a classifier
in a supervised learning setting) would be still valid for the upcoming data,
obtained from other computers and associated to different investigation
processes. More precisely, it is likely that the new data sample would come
from a different population. In this context, the use of clustering algorithms,
which are capable of finding latent patterns from text documents found in
seized computers, can enhance the analysis performed by the expert examiner. The
rationale behind clustering algorithms is that objects within a valid cluster
are more similar to each other than they are to objects belonging to a
different cluster. Thus, once a data partition has been induced from data, the
expert examiner might initially focus on reviewing representative documents
from the obtained set of clusters. Then, after this preliminary analysis, (s)
he may eventually decide to scrutinize other documents from each cluster. By
doing so, one can avoid the hard task of examining all the documents
(individually) but, even if so desired, it still could be done.
DISADVANTAGES
OF EXISTING SYSTEM:
The literature on Computer Forensics only
reports the use of algorithms that assume that the number of clusters is known
and fixed a priori by the user. Aimed at relaxing this assumption, which
is often unrealistic in practical applications, a common approach in other
domains involves estimating the number of clusters from data.
PROPOSED SYSTEM:
Clustering algorithms have been studied for decades,
and the literature on the subject is huge. Therefore, we decided to choose a
set of (six) representative algorithm in order to show the potential of the
proposed approach, namely: the partitional K-means and K-medoids, the
hierarchical Single/Complete/Average Link, and the cluster
ensemble algorithm known as CSPA. These algorithms were run with different combinations
of their parameters, resulting in sixteen different algorithmic instantiations.
Thus, as a contribution of our work, we compare their relative performances on
the studied application domain—using five real-world investigation cases
conducted by the Brazilian Federal Police Department. In order to make the
comparative analysis of the algorithms more realistic, two relative validity indexes
have been used to estimate the number of clusters automatically from data.
ADVANTAGES
OF PROPOSED SYSTEM:
Most importantly, we observed that clustering
algorithms indeed tend to induce clusters formed by either relevant or
irrelevant documents, thus contributing to enhance the expert examiner’s job.
Furthermore, our evaluation of the proposed approach in applications show that
it has the potential to speed up the computer inspection process.
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
ü Processor - Pentium –IV
ü Speed - 1.1
Ghz
ü RAM - 256
MB(min)
ü Hard
Disk - 20
GB
ü Key
Board - Standard
Windows Keyboard
ü Mouse - Two
or Three Button Mouse
ü Monitor - SVGA
SOFTWARE CONFIGURATION:-
ü Operating System : Windows XP
ü Programming Language : JAVA
ü Java Version : JDK 1.6 & above.
REFERENCE:
LuÃs Filipe da Cruz Nassif and Eduardo
Raul Hruschka “Document Clustering for Forensic Analysis: An Approach for
Improving Computer Inspection” - IEEE
TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY
2013.