Information extraction from Web pages

Ashley S James

Supervised by Andrew C Jones, Moderated by Richard Booth

In this project you will develop software to extract information from HTML pages in some agreed application domain. I would be particularly interested in software designed for Web pages that supply Biodiversity information. You will design templates that define the kinds of information to be sought for, and techniques for locating this information within a variety of Web pages. The more generic your software is, the better. The implementation language is up to you. If done as an MSc project, you will be expected to assess the feasibility of information extraction from HTML pages and compare it with the more "Semantic" approaches now available using XML, ontologies, etc and discuss this in your thesis. For MSc Strategic Information Systems, you would need to give particular attention to the requirements and needs in your chosen scenario. An ambitious version of this project would be to actually implement a prototype that allows you to explore and compare these approaches. If done as an Undergraduate project, you could apply your chosen specialisation to the project (where relevant). For example, for those with a Distributed and Mobile Systems specialism, it would be worth while considering how an information extraction facility could be delivered to a mobile phone, or even run from a mobile phone (in which case bandwidth and processing power will be issues to consider).

