A computer system having low memory access latency. In one embodiment, the
computer system includes a network and one or more processing nodes
connected via the network, wherein each processing node includes a
plurality of processors and a shared memory connected to each of the
processors. The shared memory includes a cache. Each processor includes a
scalar processing unit, a vector processing unit and means for operating
the scalar processing unit independently of the vector processing unit.
Processors on one node can load data directly from and store data
directly to shared memory on another processing node via the network.