Helping meet the diverse data needs of researchers

Phoenix provides a wide range of high quality and financially sustainable research data-related services for researchers, libraries, universities and governments. We envision a future where a thriving and diverse ecosystem of digital research resources can better meet the needs of researchers and society and is sustained by a range of funding models.

Are you ready for the Nelson memo? Data repositories and access tools will soon be a necessary component of most grant-funded research projects. Grant-funded data also risks disappearing when funding ends, resulting in lost data, resources and expertise. We work with researchers and institutions to develop long-term data warehousing and funding models, ensuring that valuable scientific data and software resources will survive and prosper.

Managing the challenges of open data, curation, and usability

Phoenix is at the cutting edge of addressing the world’s emerging research data needs. These challenges are described in a 2023 report by the Open Scholarship Initiative (Phoenix is highlighted in the data curation description). Please visit our News & Resources page for additional information.

Competition collaboration, and data sharing are three key drivers in research. Each of these drivers has unique practices, outcomes and challenges, but they are also all closely linked and affect each other. Competition has always been fundamental to the very fabric of research, for example, but as research becomes increasingly complex, collaboration is also increasingly important, and along with this, data sharing as well. Still, relatively few researchers (around 15%) currently share their data outside a limited group of colleagues in any comprehensive and meaningful way (notable exceptions include astronomy, high-energy physics and genomics) due to a variety of concerns and challenges. Similarly, the race to discover has always been a key part of science, but this race sometimes leads to a hyper-focus on secrecy, a temptation to commit fraud, hiding negative findings, and other behaviors that conflict with the needs of good science and open science.

Understanding how these three drivers operate and are evolving in the real world is important for understanding how to improve the research of tomorrow. For example the needs and concerns of researchers with regard to data sharing generally fall into six main categories: Impact, confusion, trust, access, effort and equity.

  1. IMPACT: Will my research have greater benefit if I share my data? What benefit will I get from this personally? Will my open data efforts be well received by colleagues and tenure committees?
  2. CONFUSION: Where should I begin? What kind of license should be used? What data should be shared, in what format, with whom, and in what repository?
  3. TRUST: Will my open data be misinterpreted or misused? Will my potential discoveries be scooped?
  4. EFFORT: Will complying with data requirements take up too much time? Different publishers and repositories all have different compliance formats and requirements. Will I be responsible for maintaining my data over the longterm?
  5. ACCESS: Who needs access to my data anyway and for what reasons? Some datasets are so large that they can’t be uploaded via the Internet. For what purpose will my data be used? Would data summaries suffice instead?
  6. EQUITY: Overall, is this data sharing mandate even fair to me and my colleagues? For example, data processing capabilities vary widely by region, field and institution. Researchers from lower resource institutions often lack the huge support networks and processing facilities that more privileged researchers might take for granted. So, why should these lower resourced researchers share their hard-earned information and then not be able to extract any value from it?

There are also many challenges regarding the data itself. These include:

  • How can we fund and maintain the infrastructure necessary for data processing, curation, and preservation?
  • How do we protect against link rot, and data decay and data obsolescence over time?
  • Big data keeps getting bigger. Can our sharing tools keep pace?
  • What happens to data once a research facility is shut down and data needs to be preserved and curated for decades more?
  • What happens to long tail data, and the data that sits on laptops or personal websites with minimal or no attached metadata or documentation? Not being able to capture this contributes to issues like irreproducibility, duplicate research, and innovation loss.
  • Who pays long term for data care and maintenance?
  • How do we ensure the timely sharing of critical data (insofar as rapid sharing impinges on secrecy)?
  • How do we ensure better data quality, consistency and completeness?
  • How do we standardize data formats and collection processes (where necessary) to ensure data completeness and comparability?
  • How do we create internationally agreed-upon minimum standards for metadata (further complicated when metadata are not in English)?
  • How do we establish interoperability and searchability between data platforms (without which researchers need to search and make requests across multiple platforms)?
  • How do we create internationally agreed-upon standards for Data Availability Statements?
  • Can we streamline the governance structures used by different platforms?

Other challenges include the fact that very little funding support is available to facilitate data sharing, and to improve data infrastructure systems; code sharing needs to be improved (for many kinds of research, sharing or reanalyzing data without the original code means just sharing and preserving a jumble of numbers); high level data sharing policy often conflicts (for example, the EU’s GDPR conflicts with most global clinical trials data sharing policies, and this conflict has yet to be resolved), and more (e.g., regarding which
metrics are best for evaluating open data, and how to reward open data practices).

To-date, none of the major, global open solutions policies or even the discussions leading to these policies have focused on the importance of curation in making research information useful. What is curation? Essentially, it means organizing information. This organization is everywhere and all around us: imagine grocery stores where food is not organized into aisles, amazon.com without consistent ways of cataloguing and displaying product information, ancestry.com without metadata that enables different family trees to connect together, fields of study without a sophisticated understanding of the knowledge that already exists and how it’s organized, or search engines that don’t know how to crawl the web. Organizing information is a prerequisite to making it useful (at least for humans).

In an undertaking like research, organizing information has added dimensions like making sure units of measure are standardized across fields, making sure the data being collected across studies is consistent, filling in missing pieces of data, adding explanations, and otherwise properly cleaning, documenting, labeling, transposing, and formatting work for sharing. All this effort takes time and money (someone needs to do this, and not for free), and the time for doing this is limited because projects need to report data within a given window, grant funding eventually runs out, and principal investigators and researchers eventually move on to other projects.

How much time and money is needed? No one knows for sure; data curation isn’t an activity that has been well investigated and documented, and the needs obviously vary widely based on factors like data volume, complexity, privacy considerations, intellectual property constraints, and the number of collaborators. For sure, the most widely used and long-lived curated resources require massive ongoing investments of time, money and attention. Also, while “openness” is important to some of these efforts, it’s irrelevant to others; the common denominator isn’t openness but finding reliable and relevant information, converting this into something useful, and then making the curated resource accessible to the world as part of a cohesive narrative.

Here are a few examples of highly successful curated research resources:

  • WHO: Founded in 1948, World Health Organization today manages and maintains about 75 different curated data collections related to global health and well-being as mandated by UN Member States, covering everything from HIV/AIDS cases to malaria, COVID, nutrition, injury, mortality, maternal health, mental health, immunization status, and tobacco use. Data collections are typically CC-BY licensed but a wide variety of copyrighted information is included in these collections. See who.int.
  • IHME: The Institute for Health Metrics and Evaluation (IHME) works with collaborators around the world to develop evidence that sheds light on the state of global health and provides policy makers with accessible and usable information tools and resources. Over 500 people work at IHME. Founded in 2007, funding support comes from the University of Washington, the National Science Foundation, the Gates Foundation, and elsewhere. Much of what IHME collects and curates is free, but much is included with permission. See healthdata.org.
  • TAIR: The Arabidopsis Information Resource is a database of genetic and molecular biology data for Arabidopsis thaliana, a widely used model plant. Launched in 1999, data available from TAIR includes the complete genome sequence along with gene structure, genome maps, genetic and physical markers, DNA and seed stocks, related publications, and information about the research community. TAIR is managed by the nonprofit Phoenix Bioinformatics Corporation (which manages other bioinformatics resources as well) and is also supported primarily through institutional, lab and personal subscription revenues. See arabidopsis.org.
  • ALLEN BRAIN MAP: The Allen Institute for Brain Science was established in 2003 to accelerate neuroscience research worldwide by sharing large-scale, publicly available maps of the brain. Research teams conduct investigations into the inner workings of the brain; the institute also publicly share all the data, products, and findings from their work on brain-map.org, including data, analysis tools, and lab resources. This information is copyright protected, not CC-BY licensed. See portal.brain-map.org.
  • UW DRUG INTERACTION DATABASE. The University of Washington’s Drug Interaction Database (DIDB) uses a small army of PhDs to read thousands of variously-licensed peer reviewed studies every year (as well as drug labels and NDA studies), and then manually extracts qualitative and quantitative human and clinical information related to interacting medications, food products, herbals, genetics, and other factors that can affect drug exposure in humans. Launched in 2002, DIDB is today used by hundreds of pharmaceutical companies, regulatory agencies, CROs, academic institutions and clinical support organizations around the word. Sustainability is made possible through DIDB’s licensing and subscription revenues. See druginteractionsolutions.org.
  • USAFACTS.ORG: USAfacts.org curates US government data and provides users with a polished finished product. For example, search for climate databases on data.gov and you get a long list of downloadable documents plus links to climate-related state, federal and nonprofit websites. Click the climate tab on USAfacts.org and you get a long page of easy-to-read graphs and graphics plus clear text summations and links to relevant resources. The mission of USAfacts is to provide authoritative, easy-to-use, nonpartisan data to US citizens. The site is privately funded by former Microsoft co-founder Steve Ballmer; it accepts no outside donations or funding. See usafacts.org.

Archived data isn’t typically useful right out of the box. In order to make data actually useful to other researchers, a great deal of work is often required, including but not limited to database design, data management, and data curation. Some of the most exciting data collaboration and sharing warehouses in the research world already focus on these services, but often only for data in certain niche fields, and/or only for members of certain research consortia. Phoenix is experienced with these challenges and has a long track record of success. Ask us how we can help.