A schedule can be generated for physically transposing an array such that
when the array is transferred from a first memory type to a second memory
type, the number of block transfers performed is minimized. The array can
be rearranged to ensure that most or all data elements in any block read
into internal memory are used before that block's internal storage is
reused. The algorithm can compute an offline schedule and then execute
that schedule. The method can assemble the schedule during one or more
passes with an algorithm. Scheduling passes can apply a permutation to a
representation of the array's structure and then tile the representation
to ensure efficient use of internal memory. Tiling may alter the
permutation, so the algorithm can be reinvoked and run on the tiled
representation. The algorithm can be run on successively retiled
representations until no additional tiling is required. Optimizations for
schedule generation and/or execution include: simplification of the data
structure representing the array; alteration of the order in which tiles
are accessed; inversion of permutations; processing non-canonical input;
and the production of stream output. The method can also be modified for
specialized architectures, including: block I/O architecture; fully
associative caches; set-associative caches; and multi-level memory
hierarchies.