A method and processor architecture are provided that enables efficient pre-fetching
of instructions for multithreaded program execution. The processor architecture
comprises an instruction pre-fetch unit, which includes a pre-fetch request engine,
a pre-fetch request buffer, and additional logic components. A number of pre-defined
triggers initiates the generation of a pre-fetch request that includes an identification
(ID) of the particular thread from which the request is generated. Two counters
are utilized to track the number of threads and the number of executed instructions
within the threads, respectively. The pre-fetch request is issued to the lower
level cache or memory and returns with a corresponding cache line, tagged with
the thread ID. The cache line is stored in the pre-fetch request buffer along with
its thread ID. When the particular thread later requires the instruction, the instruction
is provided from within the pre-fetch request buffer at a shorter access latency
than from the lower level cache or memory.