The global biotechnology market was valued at $1.55 trillion in 2023 and is projected to grow at a rate of 13.96% annually until 2030. We regularly look at the headlines of start-ups in the space securing funding and what that will enable them to pursue as a company, but we don’t often get a chance to zoom in on what that relationship between investor and start-up looks like. In episode 15 of Data in Biotech, we sat down with Jacob Oppenheim to find out more about how he works with start-ups as an Entrepreneur-in-Residence at Digitalis Ventures, a venture capital firm that invests in solutions to complex problems in human and animal health. He gives his take on the challenges biotech businesses face, particularly (as you would expect from us) around data.
Before joining Digitalis Ventures as Entrepreneur-in-Residence, Jacob held a variety of data science and machine learning leadership roles across the biopharma sector. This includes his role as VP of Data Science and Engineering at EQRx, which focused on developing more affordable novel medicines through efficient drug development and manufacturing. He holds a PhD in Biological Physics from Rockefeller University and an AB in Physics from Princeton University.
A big theme throughout our discussion with Jacob was how biotech companies go about establishing the strong data foundations they need for success. From the challenges of existing software to getting systems in place that allow them to scale, Jacob gave us a deep dive into data challenges as a biotech start-up. Here are the highlights:
Involved Investors (5:17): Jacob explains one of the reasons for joining Digitalis Ventures was its reputation within its portfolio companies as an involved investor. He emphasized the value of a collaborative relationship and how the organization works with early-stage start-ups to help with the transformation from something that is potentially still quite academic to something that works in the real world. This involves helping biotech startups identify which targets and hypotheses are worthy of exploration so that they can focus limited resources on the problems that could be truly transformational for their business.
The Trade-Off (9:33): Jacob gives his take on the problems biotech start-ups face with a lack of useful, usable, integrable tools. It means they then face a trade-off between investing in developing tools and platforms early on or investing in validating their core scientific hypotheses. He believes this shouldn’t be a trade-off any start-up should have to make and explains how the industry and those serving the industry can address this.
A Unified Data Ecosystem (17:12): Jacob and Ross discussed the need for a more unified and comprehensive data ecosystem in biotech. Jacob explains that current tools and systems are not integrated and do not have proper interfaces with each other. There is also a lack of modularity in the current ecosystem that makes it challenging for start-ups to implement from the outset the tools they will need to scale. Instead, he believes biotech needs an "operating system" that can connect different tools and data sources and make data useful and accessible for scientists.
The Role of Consultancies (32:52): Jacob emphasizes the important role that consultancies play in the space. Biotech companies, particularly start-ups, may not have the in-house expertise or experience to build-out technical teams. Consultancies can help set up pipelines, advise on the best systems to use, and provide expertise when needed. They also play a key role in building teams. Not every founder or company will have the skills to identify and hire the right people. Consultancies can play a valuable role in offering external expertise where useful and allowing tasks to move in house as the team grows.
ML and Imaging (36:21): When Ross asked what industry developments Jacob was most excited about, Jacob spoke about the untapped potential for imaging to act as a unifying medium for machine learning models to reveal previously unexplored targets, treatments, and pathways in biological data. He says there is “relatively no limit on what you can do with imaging.” Being able to use data from IR imaging as well as visible bands, we see a huge variety of signals that machine learning can help to identify and that you are not limited to the signal that you know about. He is looking forward to advancements in this space particularly, the ability to scale computational algorithms.
This week in ‘Continuing the Conversation,’ let’s discuss building the data foundation we discussed in the episode. Jacob made no bones about the issues with software, tools, and resources and the challenges this creates for start-ups.
At CorrDyn, we work with a number of companies across the biotech space, and one of the key considerations of many of our customers is the total cost of ownership. As a result, we guide many of our customers towards cloud-based, open-source tools, and scalable / serverless tools as a way to implement powerful data handling capabilities that can be tailored to support their business goals but without a high upfront and ongoing cost. We’re always happy to have a conversation and help biotech companies on their data journey where we can (you can get in touch here), but here are some of the tools and frameworks worth considering when looking for reliable and cost-effective options.
Overall, regardless of the problem(s) that your organization solves and the tools that your organization chooses to use to solve those problems, there are a few principles that every biotech organization should apply to ensure they are laying the foundation for the company’s data strategy across data processing, computational analysis, and machine learning:
1. Ensure you retain ownership over your strategically important data:
I use the term “ownership” here to mean something more than just “data is accessible to you via a software interface, and you have access to it”. I define “ownership” to mean that the data is being regularly replicated/exported to a place where it is being stored for analysis and integration, preferably in the cloud.
Storage is cheap, and all software systems should have tools available to export data in an automated way. If the data is tabular, utilize an open source data format that stores schema and metadata information, like parquet or Iceberg. Ensure that you use intelligent storage partitioning strategies so that data of different types and from different source systems live in different places. If you produce a high quantity of certain types of data, consider adding year/month/day to your file paths.
2. Utilize tools that decouple storage and compute to scale costs and performance as your company’s requirements change:
With your data landed in a storage environment (like AWS S3, Google Cloud Storage, or Azure Blob Storage), consistent data types stored in their own file paths, and relational data types stored in high quality open source formats , the entire world of cloud-based data tools is open to you.
High throughput aggregation and stream analysis can happen in a Spark-based platform like Databricks or in a serverless data warehouse like BigQuery, Snowflake, or MotherDuck. Batch processing frameworks like Nextflow can parallelize processing and scale up and down on cloud infrastructure according to the requirements of a given process, ensuring your company only pays for the computational power it needs. Ad hoc querying of tabular data can happen in Amazon Athena or in a hosted version of Trino. Every data pipeline can read from cloud storage and write back to storage. When the time comes to conduct Machine Learning exercises, all of the tools required to train, infer, and deploy are available at your fingertips.
3. Ensure that all code is replicable and that infrastructure is also embodied as code:
Best practices in software and data science dictate that every project should have a consistent structure to the extent possible. Dependencies of the project are written in code using a dependency manager (i.e. poetry in Python). Code linters are selected for each project and run against every commit using a pre-commit hook. Infrastructure is defined as code using a tool like Terraform. Where possible, development and production environments are containerized (using a platform like Docker) to ensure that environments are consistent as new developers are onboarded. Pull requests are regularly submitted and reviewed so that multiple members of the team understand the nature of the code that has been written. If deployment to cloud infrastructure is needed, then continuous integration is triggered by merging a pull request (using, for example, GitHub Actions).
Your company’s code is part of its infrastructure and part of its documentation. With small teams, it is tempting to ignore these practices to move faster, but small teams want to become larger teams over time. With these best practices in place, you can ensure that the code is a company asset, not just an individual asset.
This is by no means an exhaustive list, but it gives an idea of the practices and procedures your data teams can put in place to help biotech organizations build the data foundation they need to scale. We’ll hear more about what this looks like in practice on Episode 15 of Data in Biotech as we are joined by Cambrium, a molecular design technology company that shares how being cloud native from day zero has allowed them to scale.
If you want even more from CorrDyn on data tools and platforms, you can subscribe to Substack for more insights and articles from Data in Biotech host Ross Katz.
If you’d like to speak to CorrDyn about how to make the most of your in-house data and use it to support your business goals, get in touch for a free SWOT analysis.
Want to listen to the full podcast? Listen here: