Title:Assessing Compliance of Web Pages using Machine Learning
The project will focus on delivering an application capable of crawling a given set of web domains (100+), with the intention of finding pages displaying compliance related data and categorising them as compliant or non-compliant using a combination of machine learning and rule based approaches. Features used for classification will be extracted from the web pages using natural language processing (NLP); in particular the use of named entity recognition and basic information extraction is predicted. Main concepts in the webpages will be formally modelled via a small ontology, in order to support the semantic elements of NLP. The potential benefit of such a system is to dramatically reduce the manual workload in assuring disparate organisations are displaying data to the required level.
Deliverables: Final report
Student: Christopher Green
Supervisor: Irena Spasic
Moderator: Helen R Phillips
Report: Archive