The Center for Data Innovation recently spoke with Oded Falik, CTO of Strand AI, a San Francisco-based company developing machine-learning systems that analyze relationships between biological measurements to help researchers recover information that is not directly in existing datasets. Falik discussed how this approach allows pharmaceutical teams to fill critical data gaps, such as incomplete genetic, molecular, or tissue‑level information that slows drug development.
David Kertai: What problem is Strand AI solving?
Oded Falik: Most datasets used in pharmaceutical research lack key biological information needed to confidently develop new treatments. Researchers may have tissue images, blood samples, or drug‑response data, but often only for limited patient groups or without essential genetic or protein measurements. These gaps create a major bottleneck that makes it difficult to design effective therapies.
Strand AI addresses this challenge by using machine‑learning models that learn how different biological signals relate to one another. For instance, the models learn relationships between tissue images and patterns of gene or protein activity. After learning these connections, the models can infer what a missing measurement would likely show based on the biological signals researchers already have. By reconstructing a more complete view of the biological picture, our models help pharmaceutical teams make faster, more confident decisions when developing new treatments and designing clinical trials
Kertai: What data do your models rely on, and how do they make their predictions?
Falik: Our models learn from several types of biological data, including microscope images of tissue samples, measurements of gene and protein activity, genetic sequencing data, and clinical information. We train the models on datasets where research collected multiple types of measurements from the same patient or tissue sample. This structure allows the models to learn how those signals relate to one another.
For example, researchers may collect both microscope images and detailed protein measurements from the same region of tissue. After learning the relationship between those signals, the models can predict what the protein measurement would likely show using only the tissue image. This approach lets researchers recover valuable biological information without repeating complex laboratory tests.
Kertai: How do you ensure this data remains reliable and accurate?
Falik: We rely on two main safeguards. First, we train our models only on real biological measurements collected from the same patient or tissue sample. This ensures the models learn genuine biological relationships rather than patterns from synthetic or simulated data.
Second, we evaluate the models based on whether their predictions improve real decisions in drug development. We test whether the inferred measurement helps researchers identify patients who express a drug target, group patients more accurately for early clinical trials, or improve models that forecast treatment response. We compare performance with and without these inferred measurements. If they don’t improve those decisions, we don’t provide them to users.
Kertai: How do you avoid potential hallucinations in your models?
Falik: We prevent hallucinations by grounding every prediction in real biological data. We test each model on measurements it never saw during training to confirm that its outputs match real biology. We also check whether predictions stay consistent across nearby regions of a tissue sample and across patients with similar diseases. Incorrect or fabricated predictions usually break those patterns.
Bias is another concern. Many biological datasets overrepresent certain populations or disease types. We track performance across different patient groups and select training data carefully. When we find gaps in representation, we add more diverse data or avoid using the model in settings where it may not perform effectively.
Kertai: How do you ensure explainability in your models outputs for users?
Falik: Tissue‑based biological data gives us a natural advantage because the outputs are visual and easy to review. When the models infer where a protein appears in a tissue sample, researchers can view it as a map overlaid on the original image. A pathologist can examine the pattern just as they would a laboratory stain and quickly judge whether it matches known biology.
We also provide confidence scores alongside every output so users can see where the model is highly certain and where results are less reliable. Each model performs a single, clearly defined task, such as translating a tissue image into a protein‑activity map or estimating gene activity from genetic data, so researchers can easily understand what the model is doing and how to interpret its results. We believe this focused, well‑validated, and easy‑to‑interpret approach is the best way to bring AI into real clinical and pharmaceutical research environments.


