<

 

Aims and strategies

employment roles

Project description & progress report

Final Report

PROJECT DESCRIPTION AND PROGRESS REPORT
SCAMSEEK PROJECT SIRS COMPONENT

What is Scamseek?

Scamseek is a composite of 3 sub-systems built to retrieve candidate financial scam documents from the internet (Smartseye), identify high likelihood scam documents from that collection(SIRS), and extract key information such as who, when, where, what (NaLAS) from those documents.

Smartseye is a meta search engine that uses a variety of search engines to crawl the Internet to retrieve potential scam sites. This is software provided by the CMCRC partner SMARTS. SIRS is the sub-system that we built that retrieves documents from identified sites and classifies them for their likely illegality and type of scam. NaLAS (in planning stage) will extract from likely illegal documents more details as to the organisers of a scam and the when, where, what particulars about their scheme.


What does SIRS do?

Classify documents in to one of three classes
Classify Scams into one of 3 classes of Scamtype: Unlicensed Investment Advice, Unlawful Fundraising, Share Ramping.

How does SIRS achieve its goals?

SIRS consists of two computational strategies to achieve its goals. They are SHALLOW processing and DEEP processing. Shallow processing follows the traditional strategies of document classification processes that treat a document as a "bag of words". The Deep processing consists of analysis of the texts by linguists who identify the meaning features of the texts the are particular to each class of document.

How is the project organised?

We have 4 teams with different skills and responsibilities:

  • Domain Specialist Team (DST)-ASIC's users
  • Linguistics Team (LT)
  • Computational Linguists Team (CLT)
  • Software Engineering Team (SET)

The teams have defined roles:

The Linguistics Team (LT) has the task of dealing with the Domain team and identifying the characteristics of the scam sites and documents and understanding the nature of the raw data. Its work is to scrutinise the texts of each type using a trial and error approach to assess the value of any meaning features for classification. Linguistics team members use a variety of software to assist this task, some of which is developed in-house as a need is identified. They then annotate the text, aided by markup software with these features, in preparation for the experimental program.

The results of their feature selection are passed to the Computational Linguistics Team (CLT) which has the task of automating the identification of the meaning features. This task is problematic as lexical items, which are simple to identify computationally usually have meaning ambiguities. The computational task must separate the ambiguities to extract only the meanings identified as significant by the linguists. Once these features are extracted reliably then they are either directly checked for classification reliability or combined with other features to determine their usefulness in combination with the shallow analysis results.

Once the Computational Linguists have completed their task of identifying the optimal features, tokenisation, attribute formation and machine learning they pass the results onto the Software Engineering Team (SET). This team has the job of developing a production system that is operational in two senses. Firstly, the system has to be compatible with the operating needs of the Domain Specialists both at the machine level and the day-to-day operational level. Secondly, the system must replicate the results of the CLT in code that satisfies all industrial quality engineering standards of best practice.


What is the Software Solution?

The software solution has 3 different systems, which are related but independent. They are:

  1. Linguists Investigation Workbench (LIW)
  2. Classifier Development Workbench (CDW)
  3. Classifier Production System (CPS)

LIW is a collection of programs that allow the linguists to perform two tasks:

  • Task 1: Peruse the texts to explore the frequencies of interesting features.
  • Task 2: Mark-up text with features of interest. This task could use a number of different software systems that provide for manual and semi-automatic mark-up of particular features.
    The CLT have responsibility for maintenance and support of the LIW.

CDW is a workbench with a large variety of different software drawn from any useful source, put together by the CLT. It is not necessarily coherent in architecture or code in the sense that any freely available processing systems are exploited for experimentation. Two immediate sources of programs are POSTRGRES for database storage, and the WEKA system for machine learning algorithms. These sources are not necessarily used exclusively.

The CDW also consists of the programs and scripts created to do extraction of particular semantic features.

The CPS is carefully crafted and engineered piece of software that constitutes the client's working system. We have used open source software developing the CPS because the software has been well engineered and it provides a platform architecture with some service processing functions that we do not have to create for ourselves.

We also know that this approach significantly lowers the project risk both for engineering and completion date criteria.

What is the Work we do?

SIRS uses a combination of traditional and innovative methods of automated document classification. The traditional Shallow method uses a supervised machine learning method. In this method a set of documents are labeled manually into their class types. The documents are then taken as a collection and the frequency of words in each document is counted. The counts become features of the documents and are given to a machine learner to determine the optimal features and feature values for predicting each class of document. This technique is known as "bag of words" as the meanings of the words and the relationships of the words to each other are ignored. Because of the crudeness of this approach, in practical applications a great deal of experimentation has to be conducted to determine the best machine learner, and the best combination of features for a given corpus of documents, as well as traditional computational linguistic parameters such as tokenisation strategies, stop list composition, frequency counting and strategic probability smoothing.

Our innovation is to introduce linguistic or Deep analysis of the documents into the classification process. In this approach our linguists analyse the documents to identify meaning features in the texts that are representative of features that can potentially be used across multiple languages by providing an explanation for the features.

The principal disadvantage is that the same meaning can be represented in many word and phrase forms and we have to be able to compute all those forms. We are often asked about scamsters changing their texts. This situation can be considered as the idea of moving the goalposts. The scamster endeavours to move the goal posts from time to time and we have to predict the space in which the goalposts can be moved. Once they are moved outside this space they are no longer playing the scam.

Once the Linguists have established a prima facie case for using a particular meaning feature the Computational Linguists have to devise a method of computing it reliably and then run a series of experiments to establish that the feature is indeed effective at contributing to scam recognition.

How far have we got?

Staffing

We have 1 Project Director, 5 Doctoral Scholars, 2 Linguists, 2 Computational Linguists, 3 Software engineers and 5 Research assistants.

Systems

SMARTS has delivered Smartseye. ScamAlert has been delivered to ASIC and is in operation. Phase 2, a system for hunting out share ramping on bulletin boards and in chat rooms, is under development and will be delivered by 30 June 2004.