TL;DR#
Gene therapy relies on efficient delivery of genetic cargo to target cells, which requires cell-type-specific promoters to minimize off-target effects. Existing methods for promoter design are either manual, data-intensive, or lack the ability to handle closely related cell types. This necessitates new methods that are efficient and produce effective designs, particularly for similar cell types.
This research introduces a new framework that leverages a conservative model-based optimization (MBO) approach, focusing on data efficiency and minimizing adversarial designs. They use conservative objective models (COMs) for MBO, addressing practical challenges like sequence diversity and model uncertainty. The method was tested on three leukemia cell lines, experimentally validating the designed sequences and demonstrating improved cell-type specificity, especially in K562 cells where a novel promoter showed 75.85% higher cell-type specificity.
Key Takeaways#
Why does it matter?#
This paper is crucial for gene therapy researchers as it presents a data-efficient method for designing cell-type-specific promoters. This is important because current methods are expensive and time-consuming. The paper’s approach opens avenues for designing promoters for less-studied cell types and offers a new workflow for using model-based optimization techniques in this field. This is relevant to current trends in personalized medicine and offers new avenues for research in the development of more effective gene therapies.
Visual Insights#

🔼 This figure illustrates the five main steps involved in the proposed workflow for designing cell-type-specific promoters. It begins with pretraining a model on large genomic datasets and proceeds through iterative cycles of fine-tuning, design optimization, sequence selection, and experimental validation. This iterative approach aims to progressively improve promoter cell-type specificity.
read the caption
Figure 1: Our workflow for designing cell-type-specific promoters. Five main steps are highlighted in grey: (1) pretrain a base model using existing large genomic datasets; (2) fine-tune the pretrained model using the experimentally measured PE data collected so far, while also using a conservative regularizer; (3) use a gradient ascent-based optimizer to design sequences that have high predicted DE; (4) apply a final sequence selection algorithm that balances optimality and diversity to choose a smaller subset of the designed sequences; (5) experimentally measure the PE of the selected designed sequences. The last four steps can be repeated when running multiple rounds of design iterations.

🔼 This table presents a quantitative analysis of the diversity of promoter sequences generated by different methods. It measures the diversity using two metrics: mean base pair entropy and mean Hamming distance. The mean base pair entropy reflects the diversity of nucleotide usage at each position across the designed sequences, with a maximum value of 2 indicating maximum diversity (uniform usage of all four bases). The mean Hamming distance measures the average pairwise differences between sequences, ranging from 0 (identical sequences) to 250 (maximum possible distance for 250 bp long sequences). Higher values indicate higher sequence diversity.
read the caption
Table 1: Quantifying the diversity of designs. Mean base pair entropy (using binary logarithms, maximum value = 2) and mean pairwise Hamming distance (maximum value = 250) are shown for the designed sequences. Mean base pair entropy is calculated by determining the entropy of each position across all designs and then averaging these values.
In-depth insights#
COMs for MBO#
Conservative Objective Models (COMs) offer a powerful approach to offline Model-Based Optimization (MBO) for designing biological sequences, particularly addressing the challenge of adversarial designs. Standard MBO methods, relying solely on predicted performance, can inadvertently generate sequences that optimize the model’s predictions but fail in reality. COMs mitigate this by incorporating a regularization term that penalizes sequences with unexpectedly high predicted performance outside the training data distribution. This constraint enforces a more conservative optimization, leading to designs that generalize better to real-world applications. The data efficiency of COMs is crucial in biological sequence design where experimental validation is costly and time-consuming, allowing for the development of highly specific sequences with fewer experimental trials. The combination of COMs with appropriate sequence selection strategies that account for both optimality and diversity ensures that the chosen candidates for experimental validation are highly promising. Therefore, COMs represent a significant advance in the data-efficient design of biological sequences, offering a reliable path towards more effective gene therapies and other biotechnology applications.
Data-Efficient Design#
Data-efficient design in the context of gene therapy aims to minimize the cost and time associated with discovering effective promoters. Traditional methods often involve extensive experimental validation, rendering the process slow and expensive. The paper proposes using machine learning models trained on existing datasets to predict promoter activity. This reduces the reliance on costly experiments by allowing researchers to test numerous promoter candidates in silico before proceeding to experimental validation, only selecting the most promising candidates. Conservative objective models (COMs) are used to address the issue of adversarial designs, improving the accuracy of the predictions by reducing overfitting. Ensemble methods that integrate multiple models are employed to account for model uncertainty, ensuring a more reliable selection of candidates for validation. This multi-pronged approach, incorporating model-based optimization, transfer learning, and COMs, is a crucial step towards more affordable and effective gene therapy development.
Adversarial Mitigation#
Adversarial mitigation in machine learning models focuses on defending against malicious inputs designed to mislead the model. In the context of designing cell-type-specific promoters, adversarial examples could be sequences that appear optimal based on model predictions but fail in reality. Conservative Objective Models (COMs) are a valuable approach, penalizing models for overconfident predictions on unseen sequences. This approach helps to avoid creating designs that exploit model weaknesses rather than genuinely reflecting biological function. Data efficiency is paramount in promoter design, and COMs help by requiring fewer experimental validations by carefully selecting candidate sequences. Successfully mitigating adversarial designs leads to more reliable and biologically relevant promoter sequences, making gene therapy applications safer and more effective.
Diversity & Uncertainty#
In the realm of cell-type-specific promoter design, the concepts of diversity and uncertainty are paramount. Diversity in generated promoter sequences is crucial to explore the vast sequence space effectively, increasing the likelihood of discovering novel, functional promoters that were not present in initial datasets. Uncertainty in model predictions, however, presents a challenge. Since models are imperfect representations of biological reality, selecting candidates solely based on high predicted scores can lead to overfitting and the selection of sequences that perform poorly experimentally. A balanced approach is needed, where a diverse set of candidates is generated, encompassing a wide spectrum of model predictions, and then a final selection is made by carefully weighing both predicted effectiveness (optimality) and model uncertainty. Strategies incorporating metrics that account for sequence similarity and prediction confidence are critical for efficient experimental validation. This integrated approach makes the design process more robust and efficient. The ultimate goal is to identify sequences that function well experimentally, and a diverse pool of candidates, accounting for the inherent uncertainties of predictive modeling, significantly improves the chances of achieving this aim.
Future Directions#
Future research could explore expanding the methodology to a wider range of cell types and disease contexts. Addressing the limitations encountered with THP1 cells by acquiring more extensive datasets or refining model architectures is crucial. Investigating alternative optimization algorithms beyond gradient ascent and DENs could enhance efficiency and diversity of promoter designs. Further exploration of the conservative objective model (COM) framework is warranted, potentially including investigations into different regularization strategies and the development of more robust model uncertainty estimation techniques. A key area for improvement lies in developing more sophisticated sequence selection algorithms that better balance optimality, diversity, and model uncertainty. Ultimately, the goal is to create a robust and scalable workflow capable of routinely generating highly effective cell-type-specific promoters for gene therapy applications, significantly impacting therapeutic development.
More visual insights#
More on figures

🔼 This figure compares the differential expression (DE) of promoter sequences designed using the authors’ method and motif tiling against their corresponding starting sequences. The top row shows the results for the authors’ method, illustrating the improvement in DE percentile scores for Jurkat and K562 cells, while THP1 shows less improvement. The bottom row shows that motif tiling is far less effective than the authors’ method. Each subplot represents a different cell line, and the red line indicates equal DE scores between starting and designed sequences. The numbers in the corners show how many designed sequences improved/worsened DE compared to the starting sequences.
read the caption
Figure 2: Comparing the DE of designed sequences with the DE of corresponding starting sequences. The top row of plots compare the percentile scores of starting and designed sequences from our approach, and the bottom row of plots compare the percentile scores for starting and designed sequences from motif tiling. Each column represents sequences designed for one of the three cell types. The x = y line is shown in red, with the number of sequences above it highlighted in the top left corner and the number below it highlighted in the bottom right corner.

🔼 This figure compares the differential expression (DE) of sequences designed using gradient ascent and those designed using Deep Exploration Networks (DENs) across three leukemia cell lines (Jurkat, K562, and THP1). Box plots show the distribution of DE values for each cell line and optimization method. The Mann-Whitney-Wilcoxon (MWW) test assesses statistical significance, indicating whether gradient ascent produces consistently higher DE values than DENs for each cell type.
read the caption
Figure 3: Box plots comparing the DE of sequences designed using gradient ascent as the optimizer in our workflow vs. those from DENs. p-values are computed using a Mann-Whitney-Wilcoxon (MWW) test that tests whether sequences from gradient ascent have higher DE than sequences from DENs.

🔼 This figure shows the architecture of the deep learning model used in the paper for predicting promoter-driven expression. The model consists of convolutional layers followed by a transformer layer, and finally multi-layer perceptron (MLP) layers for each cell type. The convolutional layers extract features from the DNA sequence, the transformer layer captures long-range dependencies, and the MLP layers predict expression levels for each of the cell types. Activation functions, dropout rate, and group normalization are also specified in the caption.
read the caption
Figure S.1: Design model architecture: the convolutional layers use GELU activation [26], dropout with 0.1 probability [27], and are followed by a group norm layer [28] with each group size being 16. The MLP layers in the fine-tuning heads also use GELU activation. We apply the RoPE position embeddings [29] at each attention layer of the transformer.

🔼 This figure illustrates the five main steps involved in the proposed workflow for designing cell-type-specific promoters. It begins with pretraining a base model on large genomic datasets, followed by fine-tuning this model using experimental data and a conservative regularizer to address adversarial designs. Next, gradient ascent optimization is used to design sequences with high predicted differential expression. A selection algorithm then chooses a diverse subset of these sequences for experimental validation. Finally, experimental measurement of promoter-driven expression is performed, and the last four steps can be iterated for improved results.
read the caption
Figure 1: Our workflow for designing cell-type-specific promoters. Five main steps are highlighted in grey: (1) pretrain a base model using existing large genomic datasets; (2) fine-tune the pretrained model using the experimentally measured PE data collected so far, while also using a conservative regularizer; (3) use a gradient ascent-based optimizer to design sequences that have high predicted DE; (4) apply a final sequence selection algorithm that balances optimality and diversity to choose a smaller subset of the designed sequences; (5) experimentally measure the PE of the selected designed sequences. The last four steps can be repeated when running multiple rounds of design iterations.

🔼 This figure compares the cell-type specificity (DE) of promoter sequences designed using the proposed method and motif tiling against their corresponding starting sequences. The top row shows the results for the proposed method, demonstrating significant improvement in DE for Jurkat and K562 cell lines. The bottom row shows the results for motif tiling, indicating no significant improvement. The plots display the percentile scores of DE for both designed and starting sequences, allowing for a direct comparison of performance improvement.
read the caption
Figure 2: Comparing the DE of designed sequences with the DE of corresponding starting sequences. The top row of plots compare the percentile scores of starting and designed sequences from our approach, and the bottom row of plots compare the percentile scores for starting and designed sequences from motif tiling. Each column represents sequences designed for one of the three cell types. The x = y line is shown in red, with the number of sequences above it highlighted in the top left corner and the number below it highlighted in the bottom right corner.
More on tables

🔼 This table presents a quantitative analysis of the diversity of the designed promoter sequences generated by three different methods: the authors’ workflow, DENs (Deep Exploration Networks), and motif tiling. Diversity is assessed using two metrics: mean base pair entropy (measuring diversity across the sequence) and mean Hamming distance (measuring the overall dissimilarity between sequences). Higher values indicate greater diversity.
read the caption
Table 1: Quantifying the diversity of designs. Mean base pair entropy (using binary logarithms, maximum value = 2) and mean pairwise Hamming distance (maximum value = 250) are shown for the designed sequences. Mean base pair entropy is calculated by determining the entropy of each position across all designs and then averaging these values.

🔼 This table presents a quantitative analysis of the diversity of the designed promoter sequences generated using different methods. It shows the mean base pair entropy (a measure of sequence diversity) and the mean pairwise Hamming distance (a measure of the overall dissimilarity of the sequences). Higher values indicate greater diversity.
read the caption
Table 1: Quantifying the diversity of designs. Mean base pair entropy (using binary logarithms, maximum value = 2) and mean pairwise Hamming distance (maximum value = 250) are shown for the designed sequences. Mean base pair entropy is calculated by determining the entropy of each position across all designs and then averaging these values.
Full paper#
