A vision for collaborative training infrastructure for bioinformatics

In biology, a missing link connecting data generation and data‐driven discovery is the training that prepares researchers to effectively manage and analyze data. National and international cyberinfrastructure along with evolving private sector resources place biologists and students within reach of the tools needed for data‐intensive biology, but training is still required to make effective use of them. In this concept paper, we review a number of opportunities and challenges that can inform the creation of a national bioinformatics training infrastructure capable of servicing the large number of emerging and existing life scientists. While college curricula are slower to adapt, grassroots startup‐spirited organizations, such as Software and Data Carpentry, have made impressive inroads in training on the best practices of software use, development, and data analysis. Given the transformative potential of biology and medicine as full‐fledged data sciences, more support is needed to organize, amplify, and assess these efforts and their impacts.


Technology scales, but what about people?
As the term "genomical" has been suggested as an adjective to replace "astronomical" for the description of biological datasets, 1 in 2016 it is self-evident that life science is data science. High-throughput nucleic acid sequencing has given rise to massive amounts of data, including repositories such as the NCBI Sequence Read Archive (SRA). The SRA's 4.5 quadrillion bases of sequence are certain to be an invaluable resource for discovery. Clearly, data production is no longer a bottleneck; genome sequencing costs have decreased 1000-fold in the last decade and are undergoing another steep decrease this year. 2 Advances in image acquisition and analysis promise to accelerate on par with high-throughput sequencing. 3,4 As computation continues to progress, it is clear that technologies and costs underpinning biological data science will comfortably scale for some time. To quote Sartre, however, "l'existence précède l'essence" (existence precedes essence)-simply because these data exist does not guarantee that they will generate clear and direct paths to insight. Francis Collins summarized major lessons from the Human Genome Project 10 years after its completion. 5 His reflections on the remarkable advances in technology are peppered with subtle acknowledgment that some predictions made at the presidential press conference announcing the project's completion (8 months before publication) may have been bold. A decade later, The New York Times headlines 6 have captured at least some of the public's disappointment that promised advances from the sequenced genome are taking longer than originally hoped.
The Human Genome Project was, and will continue to be, transformative and, for Collins, lead to important lessons, including the necessity for open access to data, the need to keep technology a major focus of private-public development, and the importance of wise policy decisions. But the question arises as to whether there is another lesson to learn, a missing link that could help avoid another series of heightened expectations for a biology where researchers can "sequence everything." While it is easy to plot trends of how sequencing and computation have progressed, there is a potentially more difficult and more valuable question to ask: how do we scale the people who use these technologies? Although this concept paper will not be able to go much further than outlining current computational resources and framing the question of how we might go about scaling scientists to meet the challenges (and opportunities) of biological data science, we hope to spark discussion and suggest directions meriting further investigation. A vision, if not a hypothesis, for the missing link that transforms data into insight hinges upon building training infrastructure that scales people. Much of this vision stems from the authors' work in bioinformatics and scaling training and access to cyberinfrastructure through Software and Data Carpentry and CyVerse.
It is difficult to speculate exactly how much progress could have been made if the Human Genome Project had been handled by a scientific community better prepared for working the prototypical genomical dataset. Still, clues for answering this question will be as relevant to the success (or failure) of biological data science today as they would have been 10 years ago. How we build computational capacity and competence for all levels of life scientists is of everyday relevance. For example, the principal insight of NCBI's first hackathon on "Advanced Bioinformatic Analysis of Next-Gen Sequencing Data" was that the acquisition and accessibility of these data were "a major hurdle for participants, even though they were genomics professionals." 7 If even expert bioinformaticians have difficulties, how will the vast majority of biologists who must use bioinformatics or work with a bioinformatician fare? Additionally, advanced cyberinfrastructure has been developed with both centralized and decentralized solutions, but life scientists are not aware of how to access or utilize these resources. A comprehensive approach to people scaling will involve developing tools and strategies that support large numbers of existing scientists, addressing deficiencies in current training pipelines, and delivering agile approaches for up-todate and sustainable instruction. We also anticipate that the most effective strategies will be informed by the centralized and decentralized (distributed) approaches that characterize technological infrastructures already driving biological data science.

Centralizing and decentralizing-cyberinfrastructure
The development of cyberinfrastructure resources that are both centralized and decentralized serves as a model for how training can be established and scaled and provides motivation for this training, as biologists have ready access to the computation needed for their work, but often lack the skills to utilize it effectively. The current development and utilization model for cyberinfrastructure is powerful because, by developing and utilizing both centralized and decentralized approaches, it simultaneously explores multiple technologies and approaches. We will visit this observation both in thinking about how we use technology and how we train biologists to use technology, outlining available resources and efforts that have made them accessible to researchers.
Biological sensors (e.g., sequencers and cameras) and computational resources have their own blend of centralized and decentralized components; sequencers have moved away from specialized centers, and computation has left the desktop and moved into the cloud. In parallel, cyberinfrastructure has been working in the opposite direction to bring data and compute closer together through public data repositories and resources that bring compute to data. We could condense this model by saying, "find computing where you can, keep instruments close, and your data closer." In private-sector computing, virtualizationspecifically, distributed cloud computing-has made powerful computation accessible and of low cost. One analysis of a genome mapping workflow estimates the costs of mapping a human genome at less than U.S. $50, and under $400 for a scalable production environment on Amazon Web Services. 8 Archival data management platforms provide unlimited redundant storage; Amazon's current price is around $0.007 per gigabyte per month. While exact specifications of private resources such as Amazon Web Services are not known, they are certainly far more capacious than public-funded initiatives such as CyVerse (National Science Foundation (NSF)-biological cyberinfrastructure) or even XSEDE (NSF-national supercomputing for science). 9 Although the full potentials of these technologies for biology are likely underutilized, they already prompt us to imagine a "computationunlimited" paradigm for life science.
As much as computation might become unlimited, a completely decentralized approach powered by the private sector contrasts with (but can be complimented by) trends in computing for biological data science. In particular, researchers need access to public data, and biological computing has needed to develop cyberinfrastructure suited to searching, sharing, and analyzing biological datasets. Given the need for researchers to work with common data types and formats, centralization (which is, in some ways, more complicated than decentralization) is very efficient. If the tremendously useful and successful Galaxy software had a limitation early on, it was that the initial implementation followed a decentralized model in which every user needed to stand up his/her own Galaxy instance. As a result, it was difficult for users to share datasets or avoid upload limitations, and it was possible for Galaxy admins to mismanage reference data in a way that caused user frustration and/or invalid results. 10 Placing most, or at least many, of these data resources in one place has been a successful idea since GenBank and other biological repositories became web services around the mid to late 1990s. 11 It was not until recently, however, that connectivity and computational capacity could support the next evolutionary step: colocalization of datasets with sufficiently powerful resources for computation and data management.
In 2008, the NSF funded the iPlant Collaborative (now renamed CyVerse) as an answer to biologists' needs for access to computational tools and sharing datasets, not just at their own institutions, but across the entire community. 12 iPlant was not just a collection of online tools, but a true cyberinfrastructure with access to cloud and high-performance computing (HPC), as well as mechanisms to manage large biological datasets. Similar infrastructures are being developed, including by the European Union's ELIXIR project, 13 Compute Canada, 14 and the National Institutes of Health Big Data to Knowledge and Genomic Data Commons, 15 all of which set out goals and careful planning to facilitate pairing of data and compute.
Centralized public and decentralized private computing and data are not mutually exclusive; they are complimentary (after a bit of hard work to com-bine them). Merging public and private resources is likely the best current solution for meeting the needs of biologists operating at scale who need flexibility to forage for the utilization of all available resources. The Agave API developed by CyVerse and the Texas Advanced Computing Center at the University of Texas at Austin combines public and private computing resources in a science-as-a-service platform with access demanded by power users and ease-ofuse for those who prefer working with graphical user interfaces. 16 Both centralized and distributed technologies enable large complex biological investigations to move away from the exclusive operating domain of international consortia and into the laboratories of individual investigators.

Centralizing and decentralizing-training
Technology and people scale at wildly different rates. Clearly, computing will continue to make it easier for biologists to do Big Science, but training is needed to make sure that this science will be more powerful and more correct. Troublingly, Gewaltig and Cannon 17 reported that the "vast body of results being generated by current computational science practice suffers a large and growing credibility gap: it is impossible to verify most of the computational results shown in conferences and papers.'' It is people-biologists-who lack sufficient training in both the use and best practices of computation who are the bottleneck. 18 The question arises as to whether enough is being done to support these biologists; we believe that the answer is, probably not.
Bioinformatics is a relatively young field, but training efforts and formal curricula have been around for more than a decade. 19,20 Still, without a large distributed body of faculty comfortable with bioinformatics at the same level as other (more biology-based) techniques such as molecular cloning, a large skills gap will continue to exist. Survey data clearly indicate that the majority of biologists do not possess the skills needed to marshal and manage computing resources needed for their investigations. 21 Survey data from the European Molecular Biology Laboratory (EMBL) state that 60% of biologists report a need for more training (compared to 5% who say they need more computing power). 22 Similar CyVerse survey data convey that greater than 95% of researchers are, or plan to be, working with large genomics datasets, but 68% state that they have only beginner bioinformatics skills or less (J.J. Williams, unpublished data). 23 Appropriately, training efforts examined as centralized and decentralized approaches may highlight how we might borrow and adapt from the successful technological model, "keep training local, but act to develop training materials globally." Some researchers are able to take advantage of centralized training experiences at their own institutions. A local group of faculty and students, often coalescing around a new faculty member(s) who will introduce genomic approaches, can be an excellent way to help bring a department up to speed. However, a disadvantage to this approach is that it can also in some cases be unhelpful if, for example, the training program is isolated from the community and propagates out-of-date information. While local training efforts make an impact, they sometimes do not manage to scale beyond a department (great training may be happening in microbiology, but is completely unobtainable in ecology). Additionally, centralized approaches are almost necessarily slower to react to new technologies, which, again, scale orders of magnitudes faster than people or curricula.
Massive Open Online Courses (MOOCs) and other online teaching and training are an existing approach that can definitely scale the number of learners. However, since course development is typically not open, they still suffer from the same problems as any training that originates from a single source (one university or one small group of instructors). Additionally, we know that MOOCs have very low (<10%) completion rates 24 (although these rates may be better for professional development learning).
Collaborative, decentralized training has clear advantages for democratizing access and responding at scale to rapidly evolving science. Software Carpentry and Data Carpentry have developed reputations for using these approaches to deliver training and present models that more closely resemble Silicon Valley software startups than university departments. Software Carpentry started in 1998, then delivering 5-day workshops on software development for science. Their focus has been to develop skills and strategies for making science more reproducible and efficient by giving scientists the skills to develop automated workflows, use version control, and test the software they pro-duce. In 2014, Data Carpentry adopted a similar approach, but with a focus on data analysis, management, and domain-specific topics in biology (the reader is referred to Teal et al. 25 for a discussion on needs assessment for biologists and a deeper look into the thinking behind these approaches). In a sense, the Software and Data Carpentry models borrow the best aspects of formal and peer training by scaling it out-their instructors are volunteers, who have expertise and work within their respective scientific domains. Importantly, instructors share knowledge, experience, and pedagogical technique because they are unified by an open and collaborative lesson development infrastructure. Lessons are effectively developed via a crowdsourcing approach involving moderation by communityselected lesson maintainers, and contributions are organized through GitHub (again borrowing from the ideas of modern software development). The result is that lessons are openly reviewed, iteratively developed, and capture the consensus of community knowledge. So far, with a pool of just under 500 certified instructors, operating internationally at workshops on a weekly basis for the last several years, Software Carpentry has been able to reach more than 17,000 learners.
Many other decentralized (meaning, in this case, that they are driven by external and internal instructors following a dynamic and collaboratively developed curriculum, rather than an isolated group of practitioners) bioinformatics training programs exist (e.g., Cold Spring Harbor courses, Marine Biological Laboratory, Strategies and Techniques for Analyzing Microbial Population Structures, Next-Generation Sequencing, and Explorations in Data Analysis for Metagenomic Advances in Microbial Ecology), and it would be a worthy effort to align these decentralized efforts into scalable programs that share resources and increase accessibility. Even as they continue to scale, Software and Data Carpentry are only the beginning of a possible solution. There continues to be vast unmet demand for their services (typically, workshops fill up within hours of opening registration). Furthermore, while 2-day or even week-long workshop models can communicate important basic skills to get biologists started with effective bioinformatics, they are only a good beginning-accommodating all of the necessary skills or advanced training needs will require sustained longer term efforts.

Developing a training infrastructure for bioinformatics
With sustained advances in data acquisition and analysis and successful ideas for training, what are some ideas and visions for what can be done next?
We present a few suggestions driven by the authors' experiences in training in bioinformatics, in the context of delivering professional development at scale.

Technological infrastructure must be complemented by training infrastructure
There is tremendous opportunity to create a setting where the tools of bioinformatics can be as standardized as other scientific instrumentation. As capacity and technologies scale, cyberinfrastructure and cloud resources make it possible not only to analyze Big Data, but will also increasingly allow biologists to work from common platforms and interfaces. For example, technology such as Docker Containers 26 can help deliver training environments that will make lesson materials applicable across diverse contexts. It will be critical for cyberinfrastructure platforms to be closely tied to training and user needs. The need for biologist-friendly designs discussed in Kumar and Dudley 18 is very relevant, and one of the strategies that CyVerse has employed in making its platforms more usable is to pair conventional documentation with Software and Data Carpentry training on best computational practice. Amplifying the EMBL survey data, it is necessary to state again that training must be a priority and not an afterthought. It is probably asking too much to suggest that a bright graduate student should be concerned with how to train thousands of potential users before she/he publishes a novel algorithm. Collectively, however, we can develop training infrastructure that makes it easy to onboard new technologies using modular approaches, templates, and tools for training materials that can be delivered and powered by community effort.
We need to brainstorm on methods to assess both the productivity and efficiency of biologists and our training efforts This is perhaps one of the more difficult goals that the authors are aware of. Even at the earliest level, it is probably true that how we first teach bioinformatics is poorly assessed. A meta-analysis of the assessment efforts of 226 papers on genomics and bioinformatics education reveals that less than 10% of these studies contained any evidence of the reliability or validity of the assessment approaches used. 27 There is a large body of work and expertise on evidence-based instruction, but how this information could be leveraged to develop the best training needs to be better understood. Without delving into the largely open question of how to measure productivity in science, better tools could be developed to determine if scientists are adopting practices that should ensure greater productivity and reproducibility. If a graduate student is working with a complicated whole-genome association study using a Microsoft Excel spreadsheet rather than an appropriate R data frame, he/she is probably being hamstrung in significant ways. Explicit attention to this goal should be able to identify and leverage current relevant efforts. A reasonable outcome could include improved guidelines for data management plans that require more explicit details for reporting on bioinformatics methods in a way that reflects minimum standards for reproducibility.
We should advocate for bioinformatics core competencies There have been broad efforts to develop core competencies for bioinformaticians and bioinformatics training efforts, 28 and examples such as the newly funded Network for Integrating Bioinformatics into Life Sciences Education 29 indicate growing awareness of this need. With community support, a highly communicated effort to develop general recommendations and core competencies would establish clear touchstones for biologists who want to know if they can trust their skills and interpretation of results (or identify where further expertise is advisable). Aligning training to core competencies also allows biologists to focus learning on topics most relevant to their research. On a policy level, this type of effort was employed to improve undergraduate biology education through the NSF and American Association for the Advancement of Science Vision and Change in Undergraduate Biology Education report. 30 A "living" set of communitydeveloped bioinformatics lessons that represent core competencies in several relevant subdomains (e.g., genome assembly, genome-wide association study, and image acquisition) would be a powerful resource for delivering up-to-date training at scale.
We need to think about scaling the peer-training model Besides collaborative lesson development, Software and Data Carpentry are examples of how volunteer trainers (who are practicing professionals in their own domains) can be important resources for improving how skills are disseminated across communities. These instructors are passionate about sharing their expertise and are given further Carpentry training on how to be effective teachers. Although the biology community has these human resources, "the scientific community has failed to craft attractive career paths for those who do the analyses it increasingly requires." 31 We need to find a way to harness community enthusiasm and ingenuity as part of the dissemination process. This may take a variety of forms, from exploring ways to hybridize in-person workshops with virtual training, to creating generic training infrastructure so that high-value training can be regularly distributed across institutions. All state universities, for example, could host bioinformatics training nodes that would function as a "content delivery network" for bioinformatics. Additionally, we also need to understand how to incentivize and create career paths for these human resources.
We need to encourage local "communities of practice" While training is essential, it does not end when an instructor leaves the room. Training is an ongoing process that can be furthered and encouraged by engagement with a community working in and developing these practices. Programs such as Python or R user groups and meetups are successful in bringing people together with the same interests, welcoming new users, presenting information on new approaches, or responding to questions from the community. The local aspect of these groups means that researchers have local resources for help and encouragement. Additionally, they develop peer incentive for good practice in computation. Encouraging the development of these groups by incentivizing their leaders and providing easy-touse resources, such as "data analysis book clubs" or information on best practices for community organization, could further these efforts.

Concluding remarks
Not every problem in biology will require massive parallel computation and access to terabytes of data, but we need to work toward ensuring that when these resources can help, they are being exploited. Improving the educational pipeline to create greater numbers of formally trained bioinformaticians will be important, but probably misses the importance of creating a community of biologists who are computationally informed. As Lincoln Stein notes, "Bioinformatics has become too central to biology to be left to specialist bioinformaticians." 32 It is the opinion of the authors that these approaches hold potential to scale biologists. The vision presented here along with many more ideas from bioinformaticians and educators in the field will require significant support to be crafted into a true training infrastructure, perhaps at a level of effort complementary to the resources invested in developing computing infrastructure. Biologists do not need to become computational experts, but must have shared vocabulary and skills needed to be conversant with experts. A national bioinformatics training infrastructure may be the best strategy to empower researchers to participate in biology's evolution as a data science. Approaches that leave people behind will mean that the coming yottabytes of biological data could amount to adding hay to the proverbial haystack; we risk burying insights.