## References

**Citekey**: @Xiong15092011

Xiong HY, Barash Y and Frey BJ (2011). “Bayesian prediction of
tissue-regulated splicing using RNA sequence and cellular context.”
*Bioinformatics*, *27*(18), pp. 2554-2562. <URL:
http://dx.doi.org/10.1093/bioinformatics/btr444>,
http://bioinformatics.oxfordjournals.org/content/27/18/2554.full.pdf+html,
<URL:
http://bioinformatics.oxfordjournals.org/content/27/18/2554.abstract>.

## Notes

The Bayesian Neural Network technique created / reported in this article represents one of the new advances in neural network, which has some momentum in recent years. It would be interesting to test it in the educational setting, to compare its performance with techniques (like SVM) widely used in educational data mining.

## Highlights

An important goal in understanding this phenomenon is to infer a ‘splicing code’ that predicts how splicing is regulated in different cell types by features derived from RNA, DNA and epigenetic modifiers. (p. 2554)

Methods: We formulate the assembly of a splicing code as a problem of statistical inference and introduce a Bayesian method that uses an adaptively selected number of hidden variables to combine subgroups of features into a network, allows different tissues to share feature subgroups and uses a Gibbs sampler to hedge predictions and ascertain the statistical significance of identified features. (p. 2554)

Taking a predictive modeling approach, we seek a network model that can be applied to a comprehensive set of RNA sequences to make accurate probabilistic predictions for how splicing will occur under different cellular conditions, using features derived from RNA, DNA, epigenetic modifiers, etc. (p. 2555)

2 LEARNING REGULATORY MODELS (p. 2555)

The set of RNA sequences was obtained by mining EST and cDNA libraries and identifying 3665 cassette-type mouse exons, i.e. exons that are sometimes included and sometimes excluded in the spliced transcripts extracted from tissues and cell lines. The fraction of isoforms including each cassette exon was profiled across 27 mouse tissues. (p. 2555)

We are not really interested in exactly how only a single transcript is spliced at a particular point in time within a particular cell. This knowledge would neither provide an understanding of how splicing works nor enable us to predict what would happen under different conditions. A more compelling scientific goal is to infer a network model that summarizes related splicing patterns, accounts for variability in cellular conditions and how they influence splicing (p. 2555)

Measuring code quality using relative log-likelihood (p. 2556)

The feature vectors form a 3665×1014 matrix, with rows corresponding to exons and columns corresponding to features, and the tissue-dependent targets form a 3665×12 matrix, with each row containing qinc, qexc and qnc for each of the four tissue types. (p. 2556)

Training sets and test sets: (p. 2556)

The number of features is large relative to the number of examples, so methods that are not regularized will likely overfit the data. (p. 2556)

Care needs to be taken to avoid overfitting the model in such a way that generalization is not possible. (p. 2556)

3 ALGORITHMS (p. 2556)

3.1 Bayesian Neural Network (p. 2556)

To address these issues, we take a Bayesian approach (Bishop, 2006; MacKay, 1992; Neal, 1996) where the computational task is to sample from a posterior distribution over models. Predictions for novel test cases are made by averaging the predictions from the sample of models, and important features can be identified by checking to see if they are used more frequently than expected at random under the prior distribution of models. (p. 2557)

3.2 Other methods included in the benchmark (p. 2557)

4 RESULTS (p. 2558)

Comparisons of test code quality: All methods were evaluated using held out test data as described above and the test code quality for several methods is plotted in Figure 1 (see below for further details). (p. 2558)

The Bayesian method achieved a relative improvement of 52% over the previously published result obtained using boosting (Barash et al., 2010a), and significantly outperforms the other methods. (p. 2558)

Comparisons of classification accuracy: The log-likelihood based measure of code quality described above takes into account how accurately each method can assess its confidence in its predictions. A different, but related task is to apply a threshold to each method’s prediction probabilities and measure classification accuracy. (p. 2558)