A method and software for improving the performance of processors by
incorporating an execution unit operable to decode and execute single
instructions specifying three registers each containing a plurality of
data elements, the execution unit operable to multiply the first and
second registers and add the third register to produce a catenated result
containing a plurality of data elements. Additional instructions provide
group floating-point subtract, add, multiply, set less, and set greater
equal operations. The set less and set greater equal operations produce
alternatively zero or an identity element for each element of a catenated
result, the result facilitating alternative selection of individual data
elements using bitwise Boolean operations and without requiring
conditional branch operations.