RegDiffusionTrainer#

class regdiffusion.RegDiffusionTrainer(exp_array, cell_types=None, T=5000, start_noise=0.0001, end_noise=0.02, time_dim=64, celltype_dim=4, hidden_dims=[16, 16, 16], init_coef=5, lr_nn=0.001, lr_adj=None, weight_decay_nn=0.1, weight_decay_adj=0.01, sparse_loss_coef=0.25, adj_dropout=0.3, batch_size=128, n_steps=1000, train_split=1.0, train_split_seed=123, device='cuda', compile=False, evaluator=None, eval_on_n_steps=100, logger=None)[source]#

Initialize and Train a RegDiffusion model.

For architecture and training details, please refer to our paper.

> From noise to knowledge: probabilistic diffusion-based neural inference

You can access the model through RegDiffusionTrainer.model.

Parameters:
  • exp_array (np.ndarray) – 2D numpy array. If used on single-cell RNAseq, the rows are cells and the columns are genes. Data should be log transformed. You may also want to remove all non expressed genes.

  • cell_types (np.ndarray) – (Optional) 1D integer array for cell type. If you have labels in your cell type, you need to convert them to interge. Default is None.

  • T (int) – Total number of diffusion steps. Default: 5,000

  • start_noise (float) – Minimal noise level (beta) to be added. Default: 0.0001

  • end_noise (float) – Maximal noise level (beta) to be added. Default: 0.02

  • time_dim (int) – Dimension size for the time embedding. Default: 64.

  • celltype_dim (int) – Dimension size for the cell type embedding. Default: 4.

  • hidden_dim (list) – Dimension sizes for the feature learning layers. We use the size of the first layer as the dimension for gene embeddings as well. Default: [16, 16, 16].

  • init_coef (int) – A coefficent to control the value to initialize the adjacency matrix. Here we define regulatory norm as 1 over (number of genes - 1). The value which we use to initialize the model is init_coef times of the regulatory norm. Default: 5.

  • lr_nn (float) – Learning rate for the rest of the neural networks except the adjacency matrix. Default: 0.001

  • lr_adj (float) – Learning rate for the adjacency matrix. By default, it equals to 0.02 * gene regulatory norm, which equals 1/(n_gene-1).

  • weight_decay_nn (float) – L2 regularization coef on the rest of the neural networks. Default: 0.1.

  • weight_decay_adj (float) – L2 regularization coef on the adj matrix. Default: 0.01.

  • sparse_loss_coef (float) – L1 regularization coef on the adj matrix. Default: 0.25.

  • adj_dropout (float) – Probability of an edge to be zeroed. Default: 0.3.

  • batch_size (int) – Batch size for training. Default: 128.

  • n_steps (int) – Total number of training iterations. Default: 1000.

  • train_split (float) – Train partition. Default: 1.0.

  • train_split_seed (int) – Random seed for train/val partition. Default: 123

  • device (str or torch.device) – Device where the model is running. For example, “cpu”, “cuda”, “cuda:1”, and etc. You are not recommended to run this model on Apple’s MPS chips. Default is “cuda” but if you only has CPU, it will switch back to CPU.

  • compile (boolean) – Whether to compile the model before training. Compile the model is a good idea on large dataset and ofter improves inference speed when it works. For smaller dataset, eager execution if often good enough.

  • evaluator (GRNEvaluator) – (Optional) A defined GRNEvaluator if ground truth data is available. Evaluation will be done every 100 steps by default but you can change this setting through the eval_on_n_steps option. Default is None

  • eval_on_n_steps (int) – If an evaluator is provided, the trainer will run evaluation every eval_on_n_steps steps. Default: 100.

  • logger (LightLogger) – (Optional) A LightLogger to log training process. The only situation when you need to provide this is when you want to save logs from different trainers into the same logger. Default is None.

forward_pass(x_0, t)[source]#

Forward diffusion process

Parameters:
  • x_0 (torch.FloatTensor) – Torch tensor for expression data. Rows are

  • genes (cells and columns are)

  • t (torch.LongTensor) – Torch tensor for diffusion time steps.

get_adj()[source]#

Obtain the adjacency matrix. The values in this adjacency matix has been scaled using regulatory norm. You may expect strong links to go beyond 5 or 10 in most cases.

get_grn(gene_names, tf_names=None, top_gene_percentile=None)[source]#

Obtain a GRN object. You need to provide the genes names.

Parameters:
  • gene_names (np.ndarray) – An array of names of all genes. The order of genes should be the same as the order used in your expression table.

  • tf_names (np.ndarray) – An array of names of all transcriptional factors. The order of genes should be the same as the order used in your expression table.

  • top_gene_percentile (int) – If provided, we will set the value on weak links to be zero. It is useful if you want to save the regulatory relationship in a GRN object as a sparse matrix.

train(n_steps=None)[source]#

Train the initialized model for a number of steps.

Parameters:

n_steps (int) – Number of steps to train. If not provided, it will train the model by the n_steps sepcified in class initialization. Please read our paper to see how to identify the converge point.

training_curves()[source]#

Plot out the training curves on train_loss and adj_change. Check out our paper for how to use adj_change to identify the convergence point.