University of Glamorgan

Cardiff • Pontypridd • Caerdydd

Courses glam.ac.uk

Hypermedia Research Unit

Andronikos - Semantic Indices Portal

Andronikos is a web-portal closely associated with the STAR project that makes available the semantic indices of the OASIS corpus that have been provided by the Archaeology Data Service. These semantic indices have been produced by the STAR NLP Information Extraction phase using the GATE tool and driven by the English Heritage glossaries, thesauri and archaeological extension of the CIDOC CRM core ontology (the CRM-EH).

The Information Extraction phase produces annotations in two formats: as XML files coupled with content and as RDF triples decoupled from content, both based on CRM-EH based RDF representation of some key grey literature concepts.

The representation as RDF triples enables cross search of both datasets and OASIS reports in the STAR Demonstrator. The XML files show the annotations in context of the original grey literature documents and provides a fuller set of associated information.

The main objective of the web-portal development is to make available the semantic indices and to utilise the resulting semantic annotation files in the form of HTML hypertext documents and associated overview statistics of the annotations abstracted from the XML documents. The portal also makes available a set of Thesauri overlapping tools and evaluation results which, are not given public access but access may be given upon request to avlachid (at) glam.ac.uk.

The portal makes available HTML pages that demonstrate the abstractions that have been produced during the 3 main stages of the Information Extraction Process. The three phases form an extraction pipeline which performs in a cascading order, where each phase produces an extraction result that is used as input by a succeeding extraction phase. In detail, the three extraction phases are:

Preprocessing:“ aims to extract heading and tabular phrases, as well as to extract summary document sections.

EH Knowledge Resources Lookup.“ (SKOS oriented extraction): aims to generate Lookup annotations that are aligned to CIDOC-CRM ontology, and are based on gazetteers entries. Gazetteers accommodate EH Thesauri and Glossary listing which are are defined to support SKOS enabled JAPE rules that are used for expansion and disambiguation.

Lookup Synthesis“ (CRM-EH semantic annotations ): The phase generates Annotation Types aligned to CRM-EH ontology. Rules examine the connection of CIDOC-CRM Lookup annotation which have previously been generated . The phase re-annotates existing CIDOC-CRM Lookup Annotations with specialised CRM-EH entities while it creates a new ‘wrapping’ annotation which is also aligned to CRM-EH.

The portal uses server side technology PHP to handle the annotations from the XML files and to generate the relevant web pages. The portal makes use of the DOM XML for processing the XML files and for revealing the annotations of documents, while it integrates with a MySQL database server. In addition, a search engine indexing algorithm provided by the open source FDSE project was deployed in the portal to index the web-pages of the semantic annotations and the full text version pages. The search engine is used to retrieve results from both indices to visually inspect their ability to respond to common search queries.

Andronikos web-portal

University of Glamorgan

Pontypridd, CF37 1DL, UK.

© University of Glamorgan