Making a Case for a Fully Open Trait Database

In the first week of September 2013, members of the TRY trait database consortium are meeting to discuss the future of TRY, including whether TRY should revise their model of data sharing.

TRY is a community initiative aiming to

  1. Provide a global archive of plant traits, and
  2. Promote trait-based approaches in ecology and biodiversity science.

Currently, TRY includes over 3 million trait records for about 69000 plant species, with contributions from 372 participants from 179 scientific institutes worldwide. TRY data are made available via a proposals process, with over 200 projects already served. This collaborative effort is a fantastic achievement that can rightly be celebrated.

At the same time, many people both within and outside the TRY group believe the model of data sharing currently adopted by TRY is unnecessarily restrictive and that our ability to solve key questions in ecology, evolution and conservation would be better served if TRY were to move to a fully open access model where data were available for immediate download and reuse. Thus, the purpose of this post is to outline the potential benefits of moving to a fully open access model of data sharing in plant trait ecology.

Before proceeding I would like to recognise all the positive energy, good will and effort that has gone into creating TRY. At the same time, I would like to help build a vision of what TRY might become. I understand that a shift to an open data model might pose significant infrastructural and political challenges. Yet, I am certain any challenges that arise could be overcome, if only the community were to embrace the vision of an open-source model. Thus, while recognising the great things TRY has achieved, I encourage the members of TRY to take the next step, by moving towards an open source model of data usage.

Even if the organisers of TRY decide not to move towards an open access model, I hope this post may help inspire others to take up that general challenge.

Some key references

Many people have already written about the benefits of open access models. Here I simply aim to summarise some of the arguments already presented and point to relevant literature.

  • White, E.P. et al. Nine simple ways to make it easier to (re)use your data. PeerJ PrePrints, 1, e7v2. DOI: 10.7287/peerj.preprints.7v2
  • Poisot, T., Mounce, R. & Gravel, D. (2013) Moving toward a sustainable ecological science: don’t let data go to waste! DOI: 10.6084/m9.figshare.693745
  • Costello, M.J., Michener, W.K., Gahegan, M., Zhang, Z.-Q. & Bourne, P.E. (2013) Biodiversity data should be published, cited, and peer reviewed. Trends in Ecology & Evolution, 28, 454–461. DOI: 10.1016/j.tree.2013.05.002
  • Duke, C.H. & Poorter, J.H. (2013) The Ethics of Data Sharing and Reuse in Biology. BioScience, 63, 483–489. DOI: 10.1525/bio.2013.63.6.10
  • Lathrop, R.H., Rost, B., ISCB Membership, ISCB Executive Committee, ISCB Board of Directors & ISCB Public Affairs Committee. (2011) ISCB Public Policy Statement on Open Access to Scientific and Technical Research Literature. PLoS Comput Biol, 7, e1002014.DOI: 10.1371/journal.pcbi.1002014
  • Piwowar, H.A., Day, R.S. & Fridsma, D.B. (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE, 2, e308. DOI: 10.1371/journal.pone.0000308
  • Piwowar, H.A., Vision, T.J. & Whitlock, M.C. (2011) Data archiving is a good investment. Nature, 473, 285–285.DOI: 10.1038/473285a
  • Piwowar, H.A., & Vision, T.J. (2013) Data reuse and the open data citation advantage. PeerJ PrePrints 1:e1v1
  • Open Knowledge Foundation (2012) Open Data Handbook. http://opendatahandbook.org/pdf/OpenDataHandbook.pdf
  • Van Noorden, R. (2013) Data-sharing: Everything on display. Nature, 500, 243–245. DOI: 10.1038/nj7461-243a
  • Kueffer, C., Niinemets, U., Drenovsky, R.E., Kattge, J., Milberg, P., Poorter, H., Reich, P.B., Werner, C., Westoby, M. & Wright, I.J. (2011) Fame, glory and neglect in meta-analyses. Trends in Ecology and Evolution, 26, 493–494. DOI: 10.1016/j.tree.2011.07.007
  • Whitlock, M.C., McPeek, M.A., Rausher, M.D., Rieseberg, L. & Moore, A.J. (2010) Data Archiving. The American Naturalist, 175, 145–146. DOI: 10.1086/650340

Key reasons why I think we should support a fully open model of data sharing in ecology

Accelerating science

We are all united by a joint goal to discover key facts about the natural world and address the challenges posed by environmental change and human impact. As Michael Nielsen argues in this great TED talk : Open science now!, we can accelerate scientific discovery through open exchange of data, code, and ideas.

Much of the scientific community is embracing this vision, with an avalanche of extraordinary resources now available under open data access models. Examples include genetic sequence data, climate data, soils data, species distribution data, phylogenetic trees, forest inventory data, taxonomic information and remote sensing information. In addition, ecologists benefit from a wide range of open tools such as statistical packages, especially via the statistical package R. All of these resources are provided for free, with the only request being citation in publications. These resources allow trait ecologists to achieve far more than might otherwise be possible in their research. Further opening trait databases would have the same effect.

Funding agencies such as NERC, the ARC, and NSF, have also recognised the value of making data open, so increasingly they are requiring data collected with public money to be made publicly available. Specific organisations such as TERN and the Data Observation Network for Earth (DataONE) have also been specifically created to help provide access to open data. For example, the mission of DataOne is to “Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it.”

By making trait data free to download directly, TRY could greatly increase usage of its data and thus help accelerate scientific discovery.

For more info see: Piwowar et al 2011, Poisot et al 2013, The Open Data Handbook.

Transparency and reproducibility

At the same time, there is a move towards making journal publications more reproducible and transparent. From the perspective of reproducible science, the ideal paper would include as supplementary material, all the data and code used in the analyses, so that reviewers and readers can reproduce the results of a paper and evaluate the evidence for themselves, as well as build on the results. Datasets should be available in a easily downloadable format (e.g. as a csv file) and released under a well-establish license, such as the creative commons CC0 license used by data dryad. For publications reusing existing data, it might be more helpful to include a script for downloading and compiling the data from existing open sources, rather than re-releasing data available elsewhere.

Many journals are starting to embrace this vision. For example

By making trait data free to download directly, TRY would enable trait ecologists to lift the quality of their science and meet the new demands of reproducibility required by top journals in our field.

For more information see: Poisot et al 2013, Lanthrop et al 2011, Costello et al 2013.

The moral arguments

Public money = public data

As research scientists, most of us are funded by public money. Thus our data is also arguably also public. It is undeniably in the public’s greater good that scientists make their data public, once they have published the research for which they were initially funded.

By making trait data free to download directly, TRY can help honour the social contract scientists have with the public to generate knowledge and solve the problems facing society.

For more information see: Duke & Poorter 2013, Piwowar et al 2011, Poisot et al 2013

Built on the generosity of others

Some of the trait data in TRY has been compiled from the literature, using data that were made freely available by the original collectors. Honouring the good will of the original data collectors, such compilations should also be made public with the relevant publications. An analogous situation exists in the world of open source software. The GNU public license allows free, unrestricted use of a very wide range of computer source code, provided derived products are also made similarly available.

Common concerns about open access models

Data collectors deserve credit

I agree! It is just a question of how that credit is given. Under an open model, credit is given via citation but not through co-authorship on papers.

No one is suggesting that data collectors share their data before they have written the primary publications for which the data were collected. However, for reasons outlined in the previous section, I believe such data should then become publicly available and free to use by other investigators.

Any subsequent usage of data should of course be recognised through proper citation of the original data providers. This is also how other open source materials trait ecologists benefit from (see section above) are recognised. By increasing one’s H-index, citations provide a good measure of the impact of one’s work. There is also evidence to show that papers with open data have a citation advantage (Piwowar et al 2007, Piwowar et al 2013).

There is considerable debate regarding the criteria warranting co-authorship on scientific papers (e.g. Poisot et al 2013, Duke & Poorter 2013). Existing guidelines such as the Vancouver protocol state that provision of data alone in insufficient, yet this has not been established as a norm within the field of ecology.

In the case of TRY, we are mainly talking about the reuse of data which has already been published elsewhere. Many people believe co-authorship on derived works is still warranted. While recognising the good intentions of those looking to honour primary data collectors, it is worth noting that such efforts work against those of us who make their data open, by increasing the publication lists of our peers.

There are also challenges of properly citing open-access data, which are often listed in supplementary material and thus not recognised via citation engines (Kueffer et al 2011). However, this is a general problem not restricted to the open access model. Moreover, those articles which are cited are often the data compilations, rather than the original data sources.

To overcome these issues, I suggest any move towards an open source data model also support efforts to encourage proper citation of papers providing data, e.g. by making supplementary material indexed.

Data will stop flowing under an open data model

I find this hard to believe, given the new requirements of journals and funding agencies for data archiving. I also believe that most people are happy to contribute to a greater good. That is why many of us make our data immediately available at the time of publication. I am confident that most people would happily contribute to an open database, where it was evident that such contributions would be honoured by equivalent levels of openness elsewhere.

Openness is bad for my career

Some people have expressed concern at ‘giving away data’, but available evidence suggests advantages to being open. For example, in this recent commentary in Nature, Amy Zanne said that making the global wood density database public has had very positive effects on her career. That dataset has now been downloaded over 5700 times.

Articles with public data also have a citation advantage (Piwowar et al 2007, Piwowar et al 2013). Costello et al 2013 suggest data contributors have a lot to gain by making citable products with their data.

Data are hard to collect, I don’t want to give them away

Collecting data is hard, but so is everything else in science, such as writing software. I want any dataset I collect to have a continuing impact on the field. I also want my datasets to be safe from technological change and accidental loss, which is less likely when datasets are deposited on official archives.

We need to protect the vulnerable PhD students

As a basic principle, we could probably all agree that PhD students collecting substantial amounts of new data should not be asked to share this until they have published their main results. Supporting this, many journals and funding agencies requiring data archiving at the time of publication also give the option of an embargo period before data becomes public.

Under any open model, TRY could leave it up to individuals to contribute data when they feel ready.

Best practises for sharing data

There are a number of things that will make your data easier to reuse, such as:

  • Depositing it on stable server, e.g datadryad, figshare, ecological archives, so that it remains available and is citable.
  • Releasing it under a standard and flexible license agreement.
  • Having good (ideally machine readable) metadata.
  • Storing it in non proprietary, text-based format - e.g. text csv files.

The following resources provide more information on best practices for data management, reuse and sharing:

Ways to encourage openness within the field

In his TED talk, Michael Nielson asks “What are you doing to promote openness within your field?”.

For my part, I have provided a number of open source resources. This includes trait data (e.g. Falster et al 2003, Falster et al 2005, Falster et al 2005) and software packages like smatr, which is now used in 100’s of publications. I am also currently collating a large biomass and allometry dataset that will be made open access in the form of a data paper, where all data contributors are included as co-authors. Once published, anyone will be able to use this data in whatever way they like.

Here are some suggestions you might consider either within or outside the TRY framework:

  • Make your own data fully open. There are now many ways in which you can make data readily available and citable - see White et al. preprint and Poisot et al 2013 for a list of options.
  • Only work with open datasets.
  • Lobby for open data.
  • Lobby for proper recognition of data reuse via citation pathways, e.g. by including references in the main paper or via indexing of supplementary materials.
  • Recognise open data contributions during recruitment, grant and paper evaluation.
  • Enforce guidelines on data archiving where they exist.

I hope this post has helped convince you that an open trait database is both desirable and achievable. There are undoubtedly challenges to overcome, but first we need to embrace the vision.

Acknowledgements

Although not all of these people necessarily agree with sentiments expressed in this post, I would like to thank the following people for helpful discussions: Rich FitzJohn, Ethan White, Mark Westoby, Amy Zanne, Jens Kattge, Colin Prentice, Will Cornwell.

Do you agree?

If you agree with the sentiments expressed in this post, feel contact me to add your name to the list below. Conversely, if you disagree with the sentiments expressed in this post, feel free to post your thoughts elsewhere online and send me a link to your post. In the interest of providing healthy discussion, I will add a link to your post below.

You can contact me via twitter, email or by adding an issue to this github repo.

Responses

Jens Kattge and Gerhard Boenisch, the main organisers of TRY, have posted a response to my letter on the TRY website. I would like to thank Jens and Gerhard for reading my post and responding to the issues raised.

Edits

Some more resources I became aware of, after making this post:

  • Whitlock, M.C. (2011) Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution, 26, 61–65. DOI: 10.1016/j.tree.2010.11.006
  • The Panton Principles: Principles for Open Data in Science. States “Science is based on building on, reusing and openly criticising the published body of scientific knowledge. For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.”
  • Open Definitionn sets out principles that define “openness” in relation to data and content. States “A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”. See full definition here.