Dec 12 2008
Cache coherency explained
Cache coherency explained The Intel SA-110 StrongARM processor is a Harvard cache architecture processor – hence it uses separate instruction and data caches, and in addition its data cache is a write-back cache. This raises cache coherency problems in a multiprocessing configuration. Caching explained Most modern microprocessors contain some form of instruction and data caching where a small but fast bit of memory is used to speed up access to main memory. At its simplest level, a cache can pre-load blocks of memory from main memory so that the processor need not stall when performing a load. This is possible because (a) a processor usually accesses sequential memory locations and (b) loading more than one sequential memory location at the same time is faster than loading each sequential memory location one by one.Hence when the processor accesses an uncached bit of memory, the cache reads a full cache line in the hope it will be used (which it usually is).Again at its simplest level, a cache must be write-through cache whenever the processor performs a write to main memory. Quite simply, when the write is performed, the relevant cache entry is updated, and a write to main memory is issued. One could say that cache coherency has been maintained ie; the cache accurately reflects the contents of main memory.The problem is, writes to main memory are usually even slower than reads. Things can be made better through the use of write buffering where writes are collated together for burst writing (much as reads are), but especially given that main memory cannot be loaded from whilst being stored to, one can see how the processor spends a lot of time stalled waiting on main memory. The solution to this is to minimise the number of writes actually made to main memory by only doing the write once the processor has definitely finished with that location. This is what a write-back cache does – it writes out the location to main memory sometime after the actual write took place. Write-back Caching explained A write-back cache is for obvious reasons somewhat more complicated than that for a write-through cache. One must now maintain a cleanliness flag for each cache entry where a clean entry accurately reflects the contents of main memory and where a dirty entry does not. Now when the cache loads a cache line, its contents are marked as clean. When the processor changes a memory location, the change is made within the cache and that cache entry marked as dirty. At some stage later, depending on when the caching algorithm decides, clean cache entries are thrown away and dirty cache entries are written to main memory. However, here it is important to note that for the period while there are dirty cache entries for a memory location, the cache is not coherent with that memory location. Cache incoherency Now imagine if there were two processors of this type accessing a particular area of memory at the same time. The first processor would be holding most of that area within its data cache, including any changes it might have made to the data. The second processor then accesses that area of memory where it will find the original data not yet written back to, and loads that old data into its cache. Now, depending on the caching algorithm, the first processor may write back the dirty cache entries, and of course the second processor doesn’t know any better and so it may write back its changes to the data later – obviously, this causes memory corruption.Traditionally, special support hardware is used to maintain cache coherency – an update to a memory location by one processor invalidates that location in all other processor’s caches. However, no such hardware support exists in our target hardware – hence, a cache coherency solution must be implemented in software. Software based Cache Coherency 1. Mutually exclude use of an area with that area being accessed remotely So long as an area of memory is guaranteed to be exclusively available to a processor, it can enable the use of its data cache when accessing that area of memory. Using semaphores, areas of memory can be marked “in use” by a particular processor and hence exclusively allocated to that processor. 2. Prevent areas being accessed by multiple processors from being cached The StrongARM’s Memory Management Unit (MMU) allows the caching used to access memory to be specified to a 1Mb, 64k & 4k unit granularity. Hence, one can mark all shared regions of memory as uncacheable but still write-bufferable. This prevents the processor using its data cache on shared regions of memory but at least writes to shared areas permit burst writes. 3. Implement software cache coherency One severe problem with doing large volumes of work in the type of uncached shared memory above is that burst reads are not permitted. If you are copying memory from one location to another (a common enough activity), the writes would be fast enough, but the reads would be extremely slow. A solution here is to implement a form of software triggered cache coherency.For example, if processor A wanted to rewrite 64k of processor B’s memory, it could send a message to processor B to say it was about to do so. Processor B would then invalidate that area of memory within its cache and Processor A would write the 64k. Should Processor A or B then access that 64k, both would have the correct contents.The reason why as large an amount as 64k is necessary is because doing this requires a certain amount of overhead which is proportionately less as the amount increases. Also, of course, one would have to ensure that Processor B is happy to lose everything within that 64k. However, if these stringent conditions are met, the data caches for both processors can be left enabled with the resultant speed increase. Conclusions Through the use of the recommendations within this document, it is hoped that the highest possible performance can be achieved in a multiprocessing StrongARM solution. Short Note: When a datum is written to the cache, it must at some point be written to the backing store as well. The timing of this write is controlled by what is known as the write policy.In a write-through cache, every write to the cache causes a synchronous write to the backing store.Alternatively, in a write-back (orwrite-behind) cache, writes are not immediately mirrored to the store. Instead, the cache tracks which of its locations have been written over (these locations are marked dirty). The data in these locations is written back to the backing store when those data are evicted from the cache. For this reason, a miss in a write-back cache (which requires a block to be replaced by another) will often require two memory accesses to service: one to retrieve the needed datum, and one to write replaced data from the cache to the store.Data write-back may be triggered by other policies as well. The client may make many changes to a datum in the cache, and then explicitly notify the cache to write back the datum.No-write allocation is a cache policy where only processor reads are cached, thus avoiding the need for write-back or write-through when the old value of the datum was absent from the cache prior to the write.The data in the backing store may be changed by entities other than the cache, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the client updates the data in the cache, copies of that data in other caches will become stale.