A prize of £3,000 is offered for the best collection of public data representing the academic profile of members of the University. Data will likely cover directory information like name and email, together with subjects taught or studied, publications, research interests, etc.
We are organising the competition to investigate what clues might be available to a person seeking to decode anonymised information that the University might from time to time release. We assume that the curious can assemble a reasonably complete profile of University members and we would like to know what information an attacker might have available from public sources. The reason for organising a competition is that we anticipate that you will be more innovative than us in thinking of ways of acquiring and assembling the data, and that there will be a wide range of possible approaches.
The project will run for 6 weeks from Monday August 1st, 2011 through to midnight on Monday September 12th, 2011. Submissions should be in the form of a web site with profile data pages for individuals, together with a ‘data manual’ addressing the judging criteria below and a zip file of the code necessary to collect and present the data in the web pages. Questions should be addressed to firstname.lastname@example.org
Judging will necessarily be subjective, but we will use a panel of judges that will include respected members of the University and at least one external judge. Nevertheless, criteria will include:
– Volume of data discovered. Duplicated data will be discounted and effort should be made to eliminate duplicate information from entries.
– Academic relevance of the information. We are not interested in where a person dined on a given Sunday, but in data that can reasonably be said to be part of their academic profile.
– Coverage of the University. The proportion of university members covered by the profile data assembled.
– Verifiability of the data. Where/how was the data sourced so that we can respond to challenges concerning validity?
– Accuracy of the data. If the data about an individual contains inaccuracies, it may reduce the utility of the rest of the information about that individual.
– Maintainability of the data. How easy will it be to re-run the data collection to keep it up to date, should that be desirable?
– Authority of data. Whether established by independent verification or citation of authoritative source (e.g. department web site) any analysis of likely authority of the information will enhance the value of the data.
– Willingness to share approaches and techniques. There will be a strong preference for projects that either make their techniques and code available through an open licence, or make the techniques and code available to the University. Publishing of the techniques and code is also encouraged.
We are not interested in information at any cost; entries are expected to demonstrate ethical behaviour.
Although the sources should all be public, it is not our intention to publish the aggregated data ourselves. Rather, each person’s data will be made available to the person concerned through a Raven-protected page and comments will be collected. Such comments will not affect the judging.
This challenge is part of a JISC project to investigate the release of system log data in anonymised form. In analyzing the likelihood of identifying an individual from the anonymised data, it is thought valuable to understand what background information about a given individual might it be possible to assemble from public sources. This challenge will help us understand the scope and nature of such publicly available data.