A system and software for improving the performance of processors by
incorporating an execution unit configurable to execute a plurality of
instruction streams from the plurality of threads, wherein each
instruction stream includes a group instruction that operates on a
plurality of data elements in partitioned fields of at least one of the
registers to produce a catenated result.