# br Dichotomizing based on quantiles br Two versions

Dichotomizing based on quantiles

Two versions were tested, one using median-based di-chotomization, and two, using breakpoints determined by the 25th percentile and 75th percentile, creating three partitions where the middle partition is not included in the survival anal-ysis.

k-means

k-means is a standard clustering method that partitions the data points into K groups where K is a pre-specified num-ber. There are different ways to implement k-means, and for our particular implementation, each iteration of the algorithm, involved assigned every patient sample to one of two clus-ters based on whose mean or centroid minimizes the within-cluster sum of squares. This is followed by an update on the cluster centroids as the memberships of the samples change. The algorithm finally converges when the membership of

Comparing survival analysis methods for cancer RNA-seq data
5

genes in the two clusters is stable and no longer changes. We specify K=2 to dichotomize the continuous variable into 2 separate patient groups. Standard survival analysis is then run on the binary, transformed gene MIK665 (S-64315) data where the covariate represents the cluster membership identified by the aforementioned k-means algorithm.

Cox regression

A common method that does not require dichotomizing a vari-able a priori is Cox regression. This model is one of the most commonly used statistical methods for survival analysis. This model provides an estimate of treatment effect on survival af-ter adjustment for other explanatory variables. In addition, it allows for the estimation of the risk (or hazard) of death of an individual given their prognostic variables. The model is writ-ten as: h(t)=h0(t) × exp(b1 × 1+b2 × 2+ +bnxn) where h(t) is the hazard function that estimates the risk at any given time

t and is determined by a set of n covariates (x1, x2, …,xn). The regression coefficients (b1,b2,…, bn) represent the amount of adjustment in the proportional change in hazard due to the covariates. h0 is the baseline hazard function that corre-sponds to the probability of hazard when all covariates take on value equal to zero.

Distribution dichotomization method

The premise of this method is that genes may have different expression distributions in a patient cohort, and that testing for differences in patient survival time based on gene expression should therefore accommodate the shape or type of the dis-tribution. For example, if a gene is symmetrically distributed, where it follows a Normal distribution then it would be a nat-ural choice to compare the survival times of patients in both upper and lower tails of this distribution using a quantile cut-off. Alternatively, if the distribution is asymmetric, then a more sensible comparison may be to compare the survival times of the patients falling in the tail versus non-tail regions. Fi-nally, if the gene’s distribution is not unimodal but instead bi-modal, then a much more natural comparison is between the patients in each mode of the distribution. We used a com-putational scheme that assesses the most likely distribution of a gene’s expression profile by first considering bimodality through the Bimodal Index (BI) [26]. If the BI > 1.1, the gene is designated bimodal and survival time is tested between the patients classified in one group/mode versus another. If the gene is not bimodal, the expression distribution is si-multaneously tested for belonging to the Normal, Lognormal, Pareto, Gamma and Cauchy distribution and the gene is as-signed to the distribution with the most significant P-value. For genes with a distribution that is either Normal or Cauchy, sur-vival is tested for patients in the upper and lower tails versus the patients in the non-tail region. For the Gamma, Pareto and Lognormal distributions, survival is compared between the tail and non-tail regions. If all distributions tested were not significant then the gene was listed as having an unknown distribution and survival analysis was not performed.