The present invention provides a powerful and robust classification and
prediction tool, methodology, and architecture for supervised learning,
particularly applicable to complex datasets where multiple factors
determine an outcome and yet many other factors are irrelevant to
prediction. Among those features which are relevant to the outcome, they
have complicated and influential interactions, though insignificant
individual contributions. For example, polygenic diseases may be
associated with genetic and environmental risk factors. This new approach
allow us consider all risk factors simultaneously, including interactions
and combined effects. Our approach has the strength of both binary
classification trees and regression. A simple rooted binary tree model is
created with each split defined by a linear combination of selected
variables. The linear combination is achieved by regression with optimal
scoring. The variables are selected using backward shaving.
Cross-validation is used to find the level of shrinkage that minimizes
errors. Using a selected variable subset to define each split not only
increases interpretability, but also enhances the model's predictive
power and robustness. The final model deals with cumulative effects and
interactions simultaneously.