I am interested in distributed systems in general, and distributed storage, peer-to-peer systems and middleware in particular.

I am involved in the ESRC-funded Digitising Scotland project, which aims to construct a linked genealogy of Scottish historical records, with Chris Dibben, Lee Williamson, Zhiqiang Feng and Zengyi Huang at Edinburgh, and Alan Dearle, Özgür Akgün and Tom Dalton in Computer Science at St Andrews. So far we have focused on automatic classification of certain fields within the records (cause of death and occupation); now we are starting to experiment with various probabilistic linkage approaches. This work also includes Eilidh Garrett and Alice Reid at Cambridge, and Peter Christen at ANU.

I previously led a work package on linkage methodology within the ESRC-funded Administrative Data Research Centre - Scotland (funding), with Alan Dearle, Özgür Akgün, Peter Christen and Alasdair Gray at Heriot-Watt.

I am supervisor for Tom Dalton, who is doing his PhD on handling uncertainty in data linkage, with a focus on using synthetic population-scale data for evaluating population linkage approaches.

Possible PhD Projects

Analysis and Linking of Large-Scale Genealogical Datasets

This project investigates methods for analysing and linking large sets of individual genealogical records. The core problem is to take a set of digitised records, and from this create a set of inter-linked pedigrees for the population. This specification may be usefully refined with the introduction of provenance and confidence, so that rather than producing a single set of pedigrees, a set of potential pedigrees is produced along with evidence for the relationships within them. From such a representation it is possible to project out various specific sets by defining appropriate criteria.

We thus need a computational process that will not only give results such as X is the mother of Y, but also that X may be the mother of Y based on information P,Q,R, and that Z may be the mother of Y based on S,T,V. This will provide a richer information source for future research, and requires new ways of approaching the problem. In the past, algorithms have been run on data sets to produce definitive pedigrees. We propose to attack this at a meta level with a reasoning engine that will not only produce results but also reasons for those results.

For flexibility and long-term usefulness, we do not envisage a one-off process in which a set of records is fed into an algorithm and a set of pedigrees is output. Instead, a continuous process will accept an indefinite stream of records to feed into and refine an established knowledge base. Similarly, the set of rules that govern the relationships between records may be refined and evolved over time. This evolutionary approach yields a highly flexible knowledge and reasoning engine; however, allowing rules and inferred relationships to change carries the danger of information being lost during the process. To prevent this, we propose a non-destructive append-only approach to the storage of data and meta-data.

It is hoped to be able to evaluate the techniques developed using the full set of birth/death/marriage records from Scotland 1850 to the present. A preliminary phase of the project is now under way. [With Alan Dearle]

Towards Pervasive Personal Data

This project will investigate techniques for enabling pervasive file data across all the storage resources available to an individual or an organization. These may encompass personal computers and mobile devices, machines available within a work environment, and commercial online services. The envisioned infrastructure observes file changes occurring in any particular storage location, and automatically propagates those changes to all other appropriate locations, similarly to existing services such as Windows Live Mesh, DropBox and SugarSync, but:

  • does not rely on external services
  • observes and exploits variations in machine and network capabilities
  • exploits disparate storage facilities, including those provided by machines not exclusively under the user’s control
  • exploits disparate data transfer mechanisms, including physical movement of passive devices
  • supports high-level user policies governing data placement and synchronization strategy

The main challenges to implementing such infrastructure are in detecting and exploiting patterns in resource availability; planning, executing and monitoring good routes and schedules for data propagation; and supporting users in visualizing current system state and its relation to the goals of their high-level policies.

Resource-Aware Distributed Databases

The aim of this project is to further develop Angus Macdonald’s PhD work on workstation-based distributed databases. This investigated the issues in providing ACID semantics and automatic backup on a constantly shifting pool of non-dedicated machines within an organisation. Areas with potential for further research include autonomic optimisation of the placement of data and computation, the relaxation of strict consistency rules, and application to other database models. [With Alan Dearle]

Testing Distributed Software

The PlanetLab development platform allows the developer of distributed software to test it under real wide-area network conditions. However, there are significant administrative overheads involved in participation. This project would investigate the issues involved in developing a testbed for distributed applications, running on a number of local clusters, and allowing the experimenter to simulate wide-area latencies, bandwidths and machine failures. Such a testbed would need to automate the execution of repeatable experiments, and the recording and analysis of measurements. Aspects to be investigated include the system properties that can be measured, the accuracy to which real-world behaviour can be predicted, and the ease of use of the testbed.

Previous PhD Students

Back to top

Last Published: 07 Apr 2019.