A parallel processing architecture comprising a cluster of embedded
processors that share a common code distribution bus. Pages or blocks of
code are concurrently loaded into respective program memories of some or
all of these processors (typically all processors assigned to a
particular task) over the code distribution bus, and are executed in
parallel by these processors. A task control processor determines when
all of the processors assigned to a particular task have finished
executing the current code page, and then loads a new code page (e.g.,
the next sequential code page within a task) into the program memories of
these processors for execution. The processors within the cluster
preferably share a common memory (1 per cluster) that is used to receive
data inputs from, and to provide data outputs to, a higher level
processor. Multiple interconnected clusters may be integrated within a
common integrated circuit device.