Code for UAI paper: Nonparametric Bayesian Multi-Facet Clustering for Longitudinal Data
- Author: Luwei Wang
We implemented two example time series models for multi-facet clustering:
- Nonlinear growth model (single-dimensional, B-spline-based,
MFNLG, allows missing values)- Facets: intercept at time T; shape; noise
- Vector autoregressive model (multi-dimensional, one lag,
MFVAR, allows varying time series lengths)- Facets: intercept vector; transition matrix (interaction); noise vector
.MFNLG()and.MFVAR(): create an initial model with data information and specific time series model parameters..initialize(): initialize model parameters and change hyperparameters of priors through a dictionary..fit(): start fitting by mean-field Variational Inference.- We recommend running the model fitting through multiple parallel runs to avoid getting stuck in the bad local optima.
prune_thresholdshould not be greater than1/trunc_level.trunc_levelcontrols the maximum possible number of clusters and also affects the training speed.prune_thresholdaffects the learned number of clusters, and can be specified for each facet separately.
- The fitted model only returns important hyperparameters, ELBOs and training time. If you want to get the estimated parameters, call
.getEstimates(). If you want to get the estimated posterior probability matrix for each individual, call.getPiMat(). - There are built-in print and plot functions to print fitting results (with ground truth if provided)
.printRes(), visualise facets.plotFacet()and show facet cluster assignments.showClust(). You can also define your own print and plot functions. .save(): save the model
import numpy as np
from NPBayesMFVI import MFNLG, MFVAR
from joblib import Parallel, delayed
def runParal(s): # for parallel runs
model = MFVAR(data=data, trunc_level = 10, seed=s)
# model = MFNLG(data=data, trunc_level = 10, inter_knots=int_knots, seed=s)
model.initialize()
model.fit(maxIter = 300, prune_threshold=0.01)
return model
data = (numpy array in N*T or N*D*T) # for MFNLG or MFVAR model
seeds = [] # generate random seeds to use for parallel runs
num_cores = 50
paral_res = Parallel(n_jobs=num_cores)(delayed(runParal)(s) for s in seeds)
# choose the highest ELBO
ELBOs = [(i,np.round(paral_res[i].ELBO_iters[-1],2)) for i in range(len(paral_res))]
print('Optimal ELBOs of diff models:\n', ELBOs)
model_idx = np.nanargmax(ELBOs, axis=0)[1]
model = paral_res[model_idx]
print('Best model at index:', model_idx, "with ELBO:", np.round(model.ELBO_iters[-1],2),"; Model seed:", model.seed)
model.printRes() # print estimated results and sd; can also print ground truth if provided (a,pi_a,beta/Beta,pi_beta/pi_Beta,sigma,pi_sigma)
model.plotFacet()
model.showClust() # plot contingency table of cluster assignments for intercepts and coefficients
model.getEstimates() # get all estimates for customized analysis and plots
model.save(filePath)