A method and system for reducing or avoiding store misses with a data
cache block zero (DCBZ) instruction in cooperation with the underlying
hardware load stream prefetching support for helping to increase
effective aggregate bandwith. The method identifies and classifies unique
streams in a loop based on dependency and reuse analysis, and performs
loop transformations, such as node splitting, loop distribution or stream
unrolling to get the proper number of streams. Static prediction and
run-time profile information are used to guide loop and stream selection.
Compile-time loop cost analysis and run-time check code and versioning
are used to determine the number of cache lines ahead of each reference
for data cache line zeroing and to tolerate required data alignment
relative to data cache lines.