Facilitating synthetic biology literature mining and searching for the plant community

This project set up and run a hackathon at TGAC (now the Earlham Institute) to extend the ContentMine literature mining system in order to allow plant-focused synthetic biology papers to be mined for facts and subsequently searched through the Grassroots Genomics web portal.

The Idea

We propose to fund an open hackathon for technologists and biologists to come together and produce concrete digital outputs that facilitate the indexing and searching of synthetic biology texts. Utilising the open technologies of the Grassroots Genomics (http://www.tgac.ac.uk/grassroots-genomics) project at TGAC and the ContentMine (http://contentmine.org/) platform from the University of Cambridge, the hackathon will mutually improve both platforms to enable better access to research literature in plant synthetic biology.

Both ContentMine and the Grassroots Genomics portal use a search infrastructure based on Lucene (https://lucene.apache.org/), a text-based search engine, to store data taken from academic papers and supplementary data. Currently they are both using the default implementation. However, all of the indexing and scoring functionality used within Lucene for its searches is open to customisation so can be tailored specifically for plant-focused and synthetic biology terms and journals. The sections of academic papers where terms appear, e.g. abstract, materials and methods, and so on, can be taken into account when determining the weighting of particular terms, and thus the priority of the results returned to the user. Furthermore, custom parsers can be written for common types of supplemental data, for example experimental results and log files,to extract metadata that would otherwise be unavailable to be searched. We will first extend ContentMine’s ability to scrape and search relevant and important synthetic biology literature resources, and subsequently extend the Grassroots Genomics infrastructure to take advantage of this functionality.

The Team

Dr Robert Davey,
Research Group Leader, Earlham Institute, Norwich

Dr Ksenia Krasileva,
Research Group Leader, Earlham Institute, Norwich

Dr Nicola Patron,
Research Group Leader, Earlham Institute, Norwich

Mr Richard Smith-Unna,
Graduate student, Department of Plant Sciences, University of Cambridge

Dr Peter Murray-Rust,
Reader Emeritus, Department of Chemistry, University of Cambridge

Project Outputs

Project Report

Summary of the project's achievements and future plans

Project Proposal

Original proposal and application


Project Outputs

The hackathon fed into the further development of ContentMine. Find more information on the ContentMine website, blog and Github repository

Facilitating synthetic biology literature mining: March 2016 Hackathon at TGAC


We used the OpenPlant funds to invite and host at TGAC, at no cost to the attendees, technologists and users from diverse scientific backgrounds to build on and use ContentMine tools to liberate synthetic biology scientific literature. The workshop centered on novel methods for discovering information about plants from the existing literature ("Content Mining"). Peter Murray-Rust and 3 other colleagues from Cambridge prepared ContentMine
software specifically for the workshop on the basis that "anyone can run it and get useful results". We undertook the hackathon over 2 days, with the first day comprising talks and demos from the ContentMine team, alongside an afternoon of splitting into groups based on skills and/or interests gleaned from the morning, and installing the software on whatever operating system they commonly used. The second day saw the groups continuing to work on developing and improving ContentMine functionality, fixing software bugs, and using the tools to gather papers and facts from synthetic biology terms.

Attendees (Name / Affiliation / Twitter / Github)

  • Richard Smith-Unna / Cambridge + ContentMine + Mozilla / @blahah404 / github.com/blahah
  • Robert Davey / TGAC + Software Sustainability Institute (SSI) Fellow / @froggleston / github.com/froggleston
  • Dan MacLean / TSL / @danmaclean / github.com/danmaclean
  • Neil Pearson / TGAC / ... / github.com/NeilPearson
  • Christopher Kittel / Institute for System Sciences (Graz, AT) / @chris_kittel / github.com/chreman
  • Ben Ward / TGAC + The University of East Anglia / @Ward9250 / github.com/Ward9250
  • Colette Matthewman / John Innes Centre + OpenPlant / @_OpenPlant
  • Annemarie Eckes / TGAC
  • Felix Shaw / TGAC / @shaw2thefloor / github.com/shaw2thefloor
  • Lawrence Percival-Alwyn / TGAC / @LP_Alwyn / github.com/percival-alwyn
  • Toni Etuk / TGAC / @tonietuk / github.com/tonietuk
  • Xingdong Bian / TGAC / @bianxingdong / github.com/xbian
  • Anastasia Orme / John Innes Centre
  • Michael Macey / UEA
  • JD Santillana-Ortiz / University of Düsseldorf, Germany / @yjdso
  • Tom Arrow / University of Cambridge

Group 1: "Users" - led by Tom Arrow

This group spent day 1 getting the ContentMine software and dependencies working on their laptops, and set everything up to start creating ContentMine ‘dictionaries’, i.e. sets of terms related to their interests, often constructed from browsing the internet and skim-reading exemplar papers for commonly occurring terms. Day 2 was spent looking at specific examples for mining, e.g. generating a knowledge network based on an ARF7 (auxin response factor) query, with an initial dictionary comprising a small number of relevant terms (“root”, “development”, “network”) to help refine the collection of results. These results were then transformed into datatables in Excel, which were then used to visualise the knowledge network with Cytoscape.

Group 2: "Javascript internals" - led by Richard Smith-Unna

This group developed interfaces to getpapers and other ContentMine tools so that they can function not only as standalone tools, but will be able to be imported as a library into other software, e.g. automatically generating presentation slides from scientific papers scraped by getpapers and showing them with a new tool called slidewinder.

Group 3: "Text mining core" - led by Peter Murray-Rust / Rob Davey

This group spent the first day getting to grips with the AMI plugin architecture which processes the data exported by getpapers, and generates standardised outputs for further processing with other ContentMine tools such as norma. The second day was spent improving the CMine codebase that underpins AMI and norma, in order to produce a more robust and more easily extensible software library.

Group 4: "Data analysis and visualisation" - led by Chris Kittel

This group aimed to form a rich search framework and “ Smart Search ” user interface, passively learning a user’s research interests from the choices they make whilst browsing ContentMine results. As the user clicks on search results in the ContentMine data tables, feature metadata will be extracted from each clicked item and used to train a deep neural net or other form of multivariate classifier. This classifier should, over time, learn what features a user is interested in. These may even be quite obscure combinations of factors such as mixtures of dates, authors, keywords etc. which the user might not even be aware of. The classifier can then be trained to improve the ranking of items which the user is likely to find useful and interesting. This could be taken further, by implementing both global and personal coefficients in order to get "commonly searched" terms based on users’ interests but also those of the community as a