RUBEN ZAMAR - University of British Columbia
Statistical detection of rare cases in highly unbalanced two class situations is an interesting and challenging problem.
We are interested in detecting rare chemical compounds that are active against a biological target, such as lung cancer tumor cells, as part of a drug discovery process. Instead of predicting the classes of the compounds, we rank all of the compounds in terms of their probability of activity to produce a shortlist containing the maximum number of actives. We have used four assay datasets and five rich - in terms of number of variables - descriptor sets for each of the four assays. Capitalizing on the richness of variables in a descriptor set, we form the phalanxes by grouping variables together. The variables in a phalanx are good to put together, whereas the variables in different phalanxes are good to ensemble. We then form our ensemble by growing a random forest in each phalanx and aggregating them over the phalanxes. The performance of the ensemble of phalanxes is found to be better than its competitors random forest and regularized random forest. Our ensemble performs very well when there are many variables in a descriptor set and when the proportion of active compounds is very small.
In other words, the harder the problem is the better the ensemble of phalanxes performs relative to alternative procedures.