|
|
|
Project description & progress report
|
AIMS AND
STRATEGY FOR SCAMSEEK PROJECT Background The Scamseek project has a $2.2million budget to build a surveillance tool for identifying financial scams on the Internet. The project has two phases. The first phase performs document classification of Internet pages. There are two principle types of documents of concern: those that give financial advice by unregistered advisors, and illegal investment schemes. The system has two major features. Firstly, documents of known scams are to be analysed by linguists to identify the features that make them distinctive. Secondly, machine-learning strategies will be used to analyse the documents to derive other features that may be useful in classification and to extract named entities. The results of the linguistic and machine learning investigations will be combined to create a unified document classifier. The classifier will be fed by a web spider that performs a 24hour/7day week search of the Internet for potential scam sites. Phase 2 is called RampAlert. Components The ScamSeek project has been designed to consist of three components: 1. A Web
Crawler System (WCS) Web Crawler System (WCS) - SMARTSEYE This component is a process that searches the web for names of individuals and companies and retrieves contents surrounding the reference, such as a Web page or URL address or document. This system supplies the primary data for the SIRS and NaLAS systems. Statistical Information Retrieval System (SIRS) This component is a process of document analysis and classification, which uses an extensive linguistic profile of the text in documents and a sophisticated classifier algorithm. SIRS involves classifying documents according to the classes required by the client. In the Scamseek Project there are two types of classification under investigation. The first is a linguistic classification done both by hand and semi-automatically to identify semantically significant indicators of each class of document. This work is performed by linguists and once completed used to do extensive annotation of a sample of texts for testing machine computational methods simulating the linguists expert judgement. A second type is machine classification which identifies keywords, and phrases from texts of known classification as in the linguistic method. However the system operates by constructing a score for the document for each possible class using well known classification algorithms. That is the classifier or group of classifiers are trained from the data using a computational method. The classifier is then tested for its usefulness by asking it to classify documents it has not seen in its training phase and subsequently determine its reliability by calculating its precision and recall. The problematic parts of this task are to identify named entities in the texts, to decide the optimal combination of attributes to use in classification, and to determine an optimal ensemble of classifiers. The third phase is to coalesce the linguistic and machine classifier approaches to produce a single system of classification superior to any one strategy. The SIRS
project can be divided in six major phases: Natural Language Analysis System (NaLAS) The output of the SIRS system can be used directly as input to the next level of processing, that is NaLAS. The function of NaLAS is to do detailed analysis of the documents to identify specific characteristics such as the who, when, where and why of the content. NaLAS is expected to commence on completion of the SIRS system should its performance prove to be satisfactory. Partners:
|
|