|The neutrality of this article is disputed. (August 2011)|
In 2010 Acumem was acquired by Rogue Wave Software, and SlowSpotter was consolidated into ThreadSpotter.
Generally speaking, memory bus bandwidth has not seen the same improvement as CPU performance (an observation sometimes referred to as the memory wall), and with multi-core and many-core systems, the available bandwidth is shared between all cores. This makes preservation of memory bandwidth one of the most important tasks in achieving top performance.
The tool finds suboptimal memory access patterns and applies heuristics to categorize these problems to be able to explain the root cause and to suggest ideas for improving the code.
It also has predictive capabilities, as one captured performance fingerprint can be used to predict performance characteristics on both existing and hypothetical architectures. The what-if analysis also allows exploring the optimal ways to bind threads to cores, all from a single sampling run.
Spatial locality refers to the desirable property of accessing close memory locations. Poor spatial locality is penalized in a number of ways:
- Accessing data very sparsely will have the negative side effect of transferring unused data over the memory bus (since data travels in chunks). This raises the memory bandwidth requirements, and in practice imposes a limit on the application performance and scalability.
- Unused data will occupy the caches, reducing the effective cache size. This causes more frequent evictions and more round trips to memory.
- Unused data will reduce the likelihood of encountering more than one useful piece of data in a mapped cache line.
ThreadSpotter identifies the following opportunities for improving Spatial Locality:
- Poor Fetch Utilization
- Poor Write-back Utilization
- Incorrectly nested loops
- Spatial blocking (also known as Loop tiling)
- Spatial Loop fusion
Temporal locality relates to reuse of data. Reusing data while it is still in the cache avoids sustaining memory fetch stalls and generally reduces the memory bus load. ThreadSpotter classifies these optimization opportunities:
- Temporal blocking
- Temporal loop fusion
Caches were invented to hide such problems, by serving a limited amount of data from a small but fast memory. This is effective if the data set can be coerced to fit in the cache.
A different technique is called prefetching, where data transfer is initiated explicitly or automatically ahead of when the data is needed. Hopefully, the data will have reached the cache by the time it is needed.
ThreadSpotter identifies problems in this area:
- Prefetch instruction too close to the subsequent data use
- Prefetch instruction too far from the use (data may be evicted before use)
- Unnecessary prefetch instruction (since data is already in the cache)
Furthermore, the tools judge how well the hardware prefetcher is working, by modelling its behavior and by recording cases of:
- Irregular memory access pattern
Avoiding cache pollution
If a dataset has a footprint larger than the available cache, and there is no practical way to reorganize the access patterns to improve reuse, there is no benefit from storing that data in the cache in the first place. Some CPUs have special instructions for bypassing caches, exactly for this purpose.
ThreadSpotter finds opportunities to:
- Replace write instructions with non-temporal write instructions.
- Insert non-temporal prefetch hints to avoid mapping the data in higher level caches.
Coherency traffic effectiveness
When there are several caches in a system, they will be kept consistent with each other. The activities to manage the data consistency take some time to carry out. Just like it is important to observe locality properties in how you access memory locations, it is important to pay attention to, and limit the implicit coherence traffic.
For instance, in a producer/consumer scenario where the threads use a piece of shared memory to transfer data between themselves, the ownership of this memory will repeatedly shift between the producer's and the consumer's caches. If not all data in a cache line is fully consumed, but the producer revisits the cache line again, then this is a case of a poor communication pattern. It would be better to fill up the entire cache line before handing over it to the consumer.
ThreadSpotter classifies this as:
- Poor communication utilization
If two threads use their own variables, but if these variables are laid out on the same cache line, the cache line ownership will also move back and forth between the two threads. This condition can usually be fixed by separating the variables to reside in different cache lines.
ThreadSpotter identifies this as a case of:
Cache sharing effects
In CPUs with multiple levels of cache, the cache topology is sometimes asymmetric, in the sense that the communication cost between two caches on the same level is non-uniform.
ThreadSpotter allows experimenting with different thread/core bindings to compare the combined effect from sharing and coherence. If threads are sharing a cache, then the effective cache size per thread is smaller, but if the two threads share data it may be faster than if they didn't share a cache, as then the coherence traffic would start to be noticeable.
| This article uses material from the Wikipedia article Acumem SlowSpotter, that was deleted or is being discussed for deletion, which is released under the Creative Commons Attribution-ShareAlike 3.0 Unported License.