Disclosed is a parallel support vector machine technique for solving
problems with a large set of training data where the kernel computation,
as well as the kernel cache and the training data, are spread over a
number of distributed machines or processors. A plurality of processing
nodes are used to train a support vector machine based on a set of
training data. Each of the processing nodes selects a local working set
of training data based on data local to the processing node, for example
a local subset of gradients. Each node transmits selected data related to
the working set (e.g., gradients having a maximum value) and receives an
identification of a global working set of training data. The processing
node optimizes the global working set of training data and updates a
portion of the gradients of the global working set of training data. The
updating of a portion of the gradients may include generating a portion
of a kernel matrix. These steps are repeated until a convergence
condition is met. Each of the local processing nodes may store all, or
only a portion of, the training data. While the steps of optimizing the
global working set of training data, and updating a portion of the
gradients of the global working set, are performed in each of the local
processing nodes, the function of generating a global working set of
training data is performed in a centralized fashion based on the selected
data (e.g., gradients of the local working set) received from the
individual processing nodes.