Share this post on:

To that only for data from the normal distribution. For considerably
To that only for data from the normal distribution. For considerably non-normal data the choice of dimensionality may not be optimal, but neither is the method itself. Therefore transforming the data so that it would roughly follow normal distribution (such as taking logarithm of gene expression values) would be advisable.Implementation We have implemented the method, including the choice of dimensionality and the validation measures presented in the section Validation measures, as an open-source package for R [See Additional file 1]. Experiments Validation on gene expression data We first validate the method on three gene expression data sets (described in Section Methods), by BMS-214662 dose checking how well it preserves the shared variation in data sets and discards the data-specific variation.The two step procedure described in the Algorithm subsection is applied to the training data to compute the eigenvectors Vt and the whitening matrix Wt, where Wt is a block diagonal matrix containing the whitening matrices for each matrix in training data. The fused representationv t for the validation data is computed as Pd = X v W t Vd ,In case of two data sets an estimate of mutual information can be computed directly from the canonical correlations asI( X 1U1,d , X 2U 2,d ) = -where Xv is the columnwise concatenation of the validation data matrices. Variance in the fused representation is now our estimate of shared variance. We average the estimate over 3 different splits into training and validation sets. To compute the shared variance under the null hypothesis, random data sets are created from the multivariate normal distribution with a diagonal covariance matrix where the values in diagonal equal the columnwise variances of X t . The shared variance for the random data is i computed in the same way as described above. We repeat the process for 50 randomly created data sets. The shared variance in the original data is then compared to the distribution of shared variances under the null hypothesis, starting from the first dimension. When the1log(1 -i =di),based on the assumption of normally distributed PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/27663262 data. Consequently we started by confining to pairs of data sources. Figure 1 shows the results for one of the pairs in each collection; the rest are analogous. It is evident that the method retains the shared variation between data sets and the shared variation increases with increasing number of dimensions in the combined data. For more than two variables, the measures explained in the Methods Section are used. We compare the results with PCA of the concatenated data matrices. PCA is equally fast, linear, and unsupervised. Note that the proposed CCA-based method is also unsupervised as no class information is used. Furthermore, since both methods have a global optimum, differences in performance cannot be due to optimization issues. The only difference then is related to the main topic of this paper: whether to modelPage 5 of(page number not for citation purposes)BMC Bioinformatics 2008, 9:http://www.biomedcentral.com/1471-2105/9/0.1.Leukemia DataMutual information1.00 0.1.Cell Cycle Data0.Stress DataNumber of DimensionsFigure 1 Mutual Information Mutual Information. Mutual information for two data sets as a function of the reduced dimensionality. Each subgraph represents mutual information curve for two data sets corresponding to each data collection. The curves for other pairs in each data collection show a similar pattern.all information in the whol.

Share this post on: