I am interested in distributed systems in general, and distributed storage, peer-to-peer systems and middleware in particular.
I am involved in the ESRC-funded Digitising Scotland project, which aims to construct a linked genealogy of Scottish historical records, with Chris Dibben, Lee Williamson, Zhiqiang Feng and Zengyi Huang at Edinburgh, and Alan Dearle, Özgür Akgün and Tom Dalton in Computer Science at St Andrews. So far we have focused on automatic classification of certain fields within the records (cause of death and occupation); now we are starting to experiment with various probabilistic linkage approaches. This work also includes Eilidh Garrett and Alice Reid at Cambridge, and Peter Christen at ANU.
I previously led a work package on linkage methodology within the ESRC-funded Administrative Data Research Centre - Scotland (funding), with Alan Dearle, Özgür Akgün, Peter Christen and Alasdair Gray at Heriot-Watt.
- DINA 2019: 5th Workshop on Data Integration and Applications
- PAAP 2019: 10th International Symposium on Parallel Architectures, Algorithms and Programming
- PDCAT 2019: 20th International Conference on Parallel and Distributed Computing, Applications and Technologies
- IEEE ICDM DINA’18: 4th Workshop on Data Integration and Applications
- PDCAT 2016: 17th International Conference on Parallel and Distributed Computing, Applications and Technologies
- IEEE ICDM DINA’16: 3rd Workshop on Data Integration and Applications
- PAAP 2015: 7th International Symposium on Parallel Architectures, Algorithms and Programming
- IEEE ICDM DINA’15: 2nd Workshop on Data Integration and Applications
- IEEE ICDM DINA’14: Workshop on Data Integration and Applications
- PDCAT 2014: 15th International Conference on Parallel and Distributed Computing, Applications and Technologies
- PAAP 2014: 6th International Symposium on Parallel Architectures, Algorithms and Programming
- IEEE International Conference on Awareness Science and Technology, 2012, 2013
- IFIP International Conference on Network and Parallel Computing, 2012
- IEEE Consumer Communications and Networking Conference, 2004, 2005, 2006
Possible PhD Projects
Analysis and Linking of Large-Scale Genealogical Datasets
This project investigates methods for analysing and linking large sets of individual genealogical records. The core problem is to take a set of digitised records, and from this create a set of inter-linked pedigrees for the population. This specification may be usefully refined with the introduction of provenance and confidence, so that rather than producing a single set of pedigrees, a set of potential pedigrees is produced along with evidence for the relationships within them. From such a representation it is possible to project out various specific sets by defining appropriate criteria.
We thus need a computational process that will not only give results such as X is the mother of Y, but also that X may be the mother of Y based on information P,Q,R, and that Z may be the mother of Y based on S,T,V. This will provide a richer information source for future research, and requires new ways of approaching the problem. In the past, algorithms have been run on data sets to produce definitive pedigrees. We propose to attack this at a meta level with a reasoning engine that will not only produce results but also reasons for those results.
For flexibility and long-term usefulness, we do not envisage a one-off process in which a set of records is fed into an algorithm and a set of pedigrees is output. Instead, a continuous process will accept an indefinite stream of records to feed into and refine an established knowledge base. Similarly, the set of rules that govern the relationships between records may be refined and evolved over time. This evolutionary approach yields a highly flexible knowledge and reasoning engine; however, allowing rules and inferred relationships to change carries the danger of information being lost during the process. To prevent this, we propose a non-destructive append-only approach to the storage of data and meta-data.
It is hoped to be able to evaluate the techniques developed using the full set of birth/death/marriage records from Scotland 1850 to the present. A preliminary phase of the project is now under way. [With Alan Dearle]
Towards Pervasive Personal Data
This project will investigate techniques for enabling pervasive file data across all the storage resources available to an individual or an organization. These may encompass personal computers and mobile devices, machines available within a work environment, and commercial online services. The envisioned infrastructure observes file changes occurring in any particular storage location, and automatically propagates those changes to all other appropriate locations, similarly to existing services such as Windows Live Mesh, DropBox and SugarSync, but:
- does not rely on external services
- observes and exploits variations in machine and network capabilities
- exploits disparate storage facilities, including those provided by machines not exclusively under the user’s control
- exploits disparate data transfer mechanisms, including physical movement of passive devices
- supports high-level user policies governing data placement and synchronization strategy
The main challenges to implementing such infrastructure are in detecting and exploiting patterns in resource availability; planning, executing and monitoring good routes and schedules for data propagation; and supporting users in visualizing current system state and its relation to the goals of their high-level policies.
Resource-Aware Distributed Databases
The aim of this project is to further develop Angus Macdonald’s PhD work on workstation-based distributed databases. This investigated the issues in providing ACID semantics and automatic backup on a constantly shifting pool of non-dedicated machines within an organisation. Areas with potential for further research include autonomic optimisation of the placement of data and computation, the relaxation of strict consistency rules, and application to other database models. [With Alan Dearle]
Testing Distributed Software
The PlanetLab development platform allows the developer of distributed software to test it under real wide-area network conditions. However, there are significant administrative overheads involved in participation. This project would investigate the issues involved in developing a testbed for distributed applications, running on a number of local clusters, and allowing the experimenter to simulate wide-area latencies, bandwidths and machine failures. Such a testbed would need to automate the execution of repeatable experiments, and the recording and analysis of measurements. Aspects to be investigated include the system properties that can be measured, the accuracy to which real-world behaviour can be predicted, and the ease of use of the testbed.
Previous PhD Students
- Simone Conte (2018): investigated user models for managing distributed data, reported in his thesis: The Sea of Stuff: a Model to Manage Shared Mutable Data in a Distributed Environment.
- Masih Hajiarab Derkani (2014): did his PhD work on Adaptive Dissemination of Network State Knowledge in Structured Peer-to-Peer Networks. He has published his Trombone software, which is an adaptive P2P overlay, and Shabdiz, which is a very light-weight Java tool that monitors a set of machines and ensures that some given application remains running on them.
- Markus Tauber (2010): applied autonomic management to distributed storage systems. He looked at autonomic control of maintenance scheduling in Chord, and of replica retrieval concurrency in a simple distributed block storage system. The work is reported in papers at DANMS 2011 and Self-Adaptive Networking 2010 and in his thesis: Autonomic Management in a Distributed Storage System.
- Aled Sage (2003): addressed the problem of how to configure a software system with a large number of tuning parameters. He developed a tool to automatically run performance tests using various parameter values, so that the best combination of parameter values could be selected. Clearly exhaustive search is impractical due to the combinatorial explosion in the number of possible combinations; this problem is exacerbated by the fact that in a non-trivial system it may take a relatively long time to conduct each test – in the case study of an industrial mail server, each test took 30 minutes. He used Taguchi’s Design of Experiments approach to select a very small sub-set of combinations from which reasonable conclusions could still be drawn. The work is reported in a paper at CDSA 2001 and in his thesis: Observation-Driven Configuration of Complex Software Systems.
- Khawar Shehzad (2019). Thesis: Defence Against DoS Attacks using ILNP and DNS
- Oleksandr Murashko (2018). Thesis: Using Machine Learning to Select and Optimise Multiple Objectives in Media Compression
- Graeme Stevenson (2015). Thesis: An Approach to Situation Recognition Based on Learned Semantic Models
- James Smith (2013). Thesis: Investigating Performance and Energy Efficiency on a Private Cloud
- Ali Khajeh-Hosseini (2012). Thesis: Supporting System Deployment Decisions in Public Clouds
- Angus Macdonald (2012). Thesis: The Architecture of an Autonomic, Resource-Aware, Workstation-Based Distributed Database System
- Rob MacInnis (2010). Thesis: A Scalable Architecture for the Demand-Driven Deployment of Location-Neutral Software Services
- Scott Walker (2006). Thesis: A Flexible Policy-Aware Middleware System
- Evangelos Zirintsis (2001). Thesis: Towards Simplification of the Software Development Process: The Hyper-Code Abstraction