–> we use a x86 processor for this article
sorry apple M1/M2((( –> use qemu
**donot compile the code with optimizations flags
Speculative execution (can skip for now)
It is basic enhacement inside modern CPUs to improve performance by predicting future outcomes of branches inside your code –> execute them ahead of time –> faster computation
example
imagine such a simple scenario with code with a conditional branch such as if-else statement. normally CPU could follow a sequential approach –> first wait while evaluating the condition –> then follow the appropriate path
but with speculative execution –> both evaluation of condition + execution of path takes place simultaneously –> then he will continue if the predicted path is the correct path after evaluation finishes –> if not he will discard the results and computes other branches
the idea is to keep CPU himseslf busy if resources are available –> maximimum throughput + minimum idle time
well such idea was discuss for long time before it became popular –> he first became popular inside of the legendary processors like Intel Pentium Pro with aggresive speculative execution techniques where he gives signifcant performance improvements –> now the techniques inside speculative execution became more advanced complex like –> out of order, branch prediction, speculative memory access, value prediction, thread level speculation +++ (no understanding of this needed now –> continue)))
Spectre Attack (can skip for now)
such a cool breakthrough from the legendary Project Zero inside 2018.
Spectre takes advantage fundamental idea of speculative execution –> ability to predict branches accurately most of the time, but not always
this is called branch misprediction –> if the processor executes the path of a wrongly predicted branch –> he must discard the incorrect speculative results + revert system state to the point behind mispredicted branch
to understand the essence of Spectre –> first remember that speculative execution himself leaves light traces of its actions inside CPU cache hierarchy (multiple levels like L1, L2 sometimes L3 each with different capacties + different access speeds)–> while the processor attempts to roll back any speculative operations on discovering a misprediction –> it is not able to fully undo his effects inside the cache))
this remaining cache information might include information about data evictions, updates, accesses ++ that happen inside the speculative phase
so from attacker side–> this is good for us –> with a cunning Spectre attack exploit (+ nice observation + timing )–> this remaining cache information can be used to extract sensitive data inside a victim process))
how to do this exactly?
we measure the time to access specific memory location –> which varies depend on whether associated cache lines were manipulated during speculative execution
Optimization
Inside this article we will write the optimization part of the Spectre attack (no need to understand Spectre now –> continue)))
“but why study such garbage man? i am not a programmer, i am a pentester sitting inside XSS commercial forum for a locker payment of 5kk$ notes from the cans of white house?”
–> well fuck himself
there are two important features of optimization we discuss
–> reduce noise
noise reduction inside cache access time measurements is important for a Spectre exploit –> increase accuracy of timing side-channel used to extract sensitive information –> more chance of successful attack (if you do not understand how noise affects Spectre –> do not worry –> this is not important for now, not the scope inside article
–> first you can learn why noise arises –> and how to reduce him?)
any interference + disturbance (example such bastards come from other processes, interrupts, context switches, even variations inside system performance himself) will fuck the result of accurate measurement of the timing window
–> balanced timing window
timing windows –> specific time period during which attacker can observe the side effects of speculative execution which includes obtaining information about CPU cache state + other microarchitectural components. small timing window –> increasing precision of attack which also reduce potential for more noise
with such small timing window –> we can accurately meaure the timing difference between cache hits and misses
–> reduce noise
with big timing window –> more time for extracting info
so which one to decide?
this is controversy debate –> but there is not a single answer himself
–> you have to profle the target system to understand cache behavior + spec execution patterns + timing variatitons
–> experiment with different window sizes + optimize
–> adaptive timing window
…..
but we will not discuss such inside this article –> so we focus more on the noise himself (final important factor than timing window)
Cache Lines
Modern processors normally use a cache based memory hierarchy for improve the memory access performance. the cache is divided into fixed-size units called cache lines –> each line contains multiple memory addresses
by access memory in a cache friendly manner –> where memory addresses are likely to be accessed together are stored inside same cache line –> the processor can minimize the number of cache misses + improve performance

Prime+Probe Technique
The technique is the most basic technique for noise reduction by carefully controlling cache state + minimizing interference from other bastards
the basic idea –> prime the cache by loading specific data including specific victim process to evict some of its own data–> after certain delay we can probe the cache to measure time takes to access certain memory locations –> we then analyze the timing for victim’s memory access patterns and extract sensitive info
NOTE –> we use the x86intrin header to access Intel intrinsics for x86 processors to work with special CPU instructions
–> we define a function prime_probe which takes array pointer –> prime probing technique (flush cache lines + access them –> measure access time)
mm_clflush instruction to flush each cache line in given range –> this will ensure that cache lines are empty –> need to be fetch from inside main memory when accessed
we access cache inside a nice cache friendly manner with 64 byte cache lines
** use volatile variable for sum so the compiler does not fuck him inside the ass
#define CACHE_LINE_SIZE 64
void prime_probe(uint8_t *ptr) {
volatile uint64_t sum = 0; // store sum of accessed cache lines
// flush cache lines
for (int i = 0; i < 256; i++) {
_mm_clflush(ptr + i * CACHE_LINE_SIZE);
}
// access flushed cache lines --> measure access time
for (int i = 0; i < 256; i++) {
sum += *(volatile uint64_t *)(ptr + i * CACHE_LINE_SIZE);
// ***access cache lines using a volatile pointer --> ensure actual memory accesses
}
}
–> now the code is finish –> lets measure performance of access array with and without prime probing
uint8_t *array = (uint8_t *)malloc(ARRAY_SIZE * CACHE_LINE_SIZE);
but why such ARRAY_SIZE * CACHE_LINE_SIZE ? –> the goal of such allocation for such specific size to align with cache line size –> such alignment helps improve memory access performance due to cache behavior because it reduces cache misses and improves memory access patterns –> better utilization of CPU cache and even faster execution times
#define ARRAY_SIZE (1 << 20)
#define ITERATIONS 1000
int main() {
uint8_t *array = (uint8_t *)malloc(ARRAY_SIZE * CACHE_LINE_SIZE);
uint64_t start, end;
// first measurement --> without prime probe
start = __rdtsc();
for (int i = 0; i < ITERATIONS; i++) {
volatile uint64_t sum = 0;
for (int j = 0; j < ARRAY_SIZE; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtsc();
printf("without prime probe --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
// second measurement --> with prime probe
start = __rdtsc();
for (int i = 0; i < ITERATIONS; i++) {
prime_probe(array);
volatile uint64_t sum = 0;
for (int j = 0; j < ARRAY_SIZE; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtsc();
printf("with prime probe --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
free(array);
return 0;
}
Results
without prime probe --> 3812776 cycles/iteration
with prime probe --> 3518063 cycles/iteration
without prime probe --> 3742036 cycles/iteration
with prime probe --> 3662542 cycles/iteration
“aah such garbage man results from bastard drpalpatine –> i will throw 2 dislikes if i can”
–> we analyze and try improve
Can we improve?
From simple static analysis of code –> we find some bastards
1. increase ITERATIONS + ARRAY_SIZE
–> we will now get more accuracy by reducing outliers inside measurement
–> large array size increases chance of interference between memory accesses –> which is exactly what we want to measure with Prime+Probe
–> avoid issue with cache thrashing –> also we learn a new concept –> cache thrashing (when cache is frequently filled with data that is quickly evicte/replace by new data –> results in high cache misses) such concept itself is because of many reasons and there are nice mitigations –> outside of scope of this article
but such only improvement will not solve our problems because the problem is not significant as such numbers are already big and no practical difference inside results
2. instead of flush all cache lines inside array –> why not target specific cache lines which will likely interfere with the subsequent memory accesses
3. instead of access array elements sequential order –> we use different patterns (such as random access or even strided access)
why? –> better simulate real world memory access patterns –> he will help increase relevance of measurements + reduce impact of any specific optimization/interference inside our system
4. we will use more powerful timing technique –> RDTSC is a simple function –> he does not take effects of CPU frequency scaling, cache effects other bastards who will eat our accuracy
–> we will use the rdtscp instruction instead of rdtsc –> he will return a CPU ID of executing core
–> especially inside multiple cores or other complex systems
now enough theory –> apply
–> first we will modify the prime_probe() –> we will only target specific cache lines instead of all cache lines inside our array
void prime_probe(uint8_t *ptr, int start, int end) {
volatile uint64_t sum = 0; // store sum of accessed cache lines
// flush --> specific cache lines
for (int i = start; i < end; i++) {
_mm_clflush(ptr + i * CACHE_LINE_SIZE);
}
// access flushed cache lines --> measure access time
for (int i = start; i < end; i++) {
sum += *(volatile uint64_t *)(ptr + i * CACHE_LINE_SIZE);
// ***access cache lines using a volatile pointer --> ensure actual memory accesses
}
}
–> now we add random selection of cache set to Prime+Probe in each iteration + random access pattern instead of sequential
#define PRIME_PROBE_ROUNDS 256 // number of cache lines targeted
int main() {
uint8_t *array = (uint8_t *)malloc(ARRAY_SIZE * CACHE_LINE_SIZE);
uint64_t start, end;
unsigned int garbage; // he will be unused --> because the compiler will optimize away the rdtscp instruction
// first measurement --> without prime probe
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
volatile uint64_t sum = 0;
for (int j = 0; j < ARRAY_SIZE; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtscp(&garbage);
printf("without Prime+Probe --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
// second measurement --> with prime probe
srand(time(NULL));
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
// random select a cache set --> prime and probe
int set_index = rand() % (ARRAY_SIZE / PRIME_PROBE_ROUNDS);
int prime_start = set_index * PRIME_PROBE_ROUNDS;
int prime_end = prime_start + PRIME_PROBE_ROUNDS;
prime_probe(array, prime_start, prime_end);
// access all cache lines --> inside selected cache set
volatile uint64_t sum = 0;
for (int j = prime_start; j < prime_end; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtscp(&garbage);
printf("with Prime+Probe --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
free(array);
return 0;
}
Results
without Prime+Probe --> 58635843 cycles/iteration
with Prime+Probe --> 47416 cycles/iteration
without Prime+Probe --> 15100261 cycles/iteration
with Prime+Probe --> 13211 cycles/iteration
nice))) such a cool improvement
Conclusion for Prime+Probe
“again drpalpatine you said –> we will decrease noise so it will be useful inside Spectre attacks but why did bastard decrease time cycles?”
–> here CPU cycles is like a proxy for noise inside cache access time measurements –> because caches access times are affected by noise simply
“does that mean noise is reduced by 99%? cool”
–> not really –> but we can assume significant noise reduction happened –> but we know that time cycles proportional to noise roughly
to accurately quantify such noise reduction –> we need to use more advanced statistical analysis/simulation approach etc so we can isolate impact of prime+probe
again remember why such study is important –>
noise reduction inside cache access time measurements is important for a Spectre exploit –> increase accuracy of timing side-channel used to extract sensitive information –> more chance of successful attack.”
however such 97% improvement should be taken with salt grains –> it depends on specific hardware etc
Problems with Prime+Probe
cache contention –> when multiple processes/threads access same cache lines –> he will implement accuracy
cache replacement policy –> Prime+Probe depends inside cache replacement policy to evict target data from cache –> but different cache replacement policies have different evict patterns –> prediction when target data will be evicted become difficult
limited cache sets –> Prime+Probe is most effective when target data is inside a big number of cache sets –> but inside some systems –> number of cache sets is limited –> it will become difficult to isolate target data
etc
Flush+Reload
it is technique to measure time take to access a memory address –> determine if its inside cache or not –> we can do this flushing cache line containing containing memory address –> measure time take to reload cache line –> if time taken to reload cache line is shorter than threshold –> memory inside cache else not))
–> we define flush_cache_line function to flush a cache line from the CPU cache hierarchy using __mm_clflush function –> he is an intrinsic function that invokes CLFLUSH instruction –> force the CPU to write back + invalidate specified cache line
void flush_cache_line(volatile uint8_t* ptr) {
_mm_clflush((const void*)ptr);
}
–> we define reload_cache_line function to measure access time to a cache line
uint64_t reload_cache_line(volatile uint8_t* ptr) {
uint64_t start, end;
unsigned int aux;
start = __rdtscp(&aux);
volatile uint8_t value = *ptr;
end = __rdtscp(&aux);
return (end - start);
}
–> we will perform Flush+Reload technique inside the flush_reload function on cache line of array
first calculate pointer to cache line by multiplying index by CACHE_LINE_SIZE –> call flush_cache_line to flush cache line from CPU cache hierarchy –> call reload_cache_line to measure access time to cache line –> if access time less than THRESHOLD value –> print a message indicating a cache hit
void flush_reload(uint8_t* array, int index) {
volatile uint8_t* ptr = &array[index * CACHE_LINE_SIZE];
flush_cache_line(ptr);
uint64_t access_time = reload_cache_line(ptr);
if (access_time < THRESHOLD) {
printf("cache hit at index %d --> access time --> %lu\n", index, access_time);
}
}
–> we modify previous main function for Flush+Reload
#define CACHE_LINE_SIZE 64
#define ARRAY_SIZE (1 << 22)
#define ITERATIONS 100
#define THRESHOLD 200 // cache access threshold
int main() {
uint8_t *array = (uint8_t *)malloc(ARRAY_SIZE * CACHE_LINE_SIZE);
uint64_t start, end;
unsigned int garbage; // unused variable
// first measurement --> without Flush+Reload
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
volatile uint64_t sum = 0;
for (int j = 0; j < ARRAY_SIZE; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtscp(&garbage);
printf("without Flush+Reload --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
// second measurement --> with Flush+Reload
srand(time(NULL));
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
int index = rand() % ARRAY_SIZE;
flush_reload(array, index);
}
end = __rdtscp(&garbage);
printf("with Flush+Reload --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
free(array);
return 0;
}
without Flush+Reload --> 28099291 cycles/iteration
cache hit at index 792050 --> access time --> 182
cache hit at index 313170 --> access time --> 199
cache hit at index 1445633 --> access time --> 186
with Flush+Reload --> 865 cycles/iteration
without Flush+Reload --> 28199962 cycles/iteration
cache hit at index 1636770 --> access time --> 198
with Flush+Reload --> 525 cycles/iteration
cool with such significant results? what does such mean?
–> when processor access memory –> he store a copy of accessed data in a small, fast cache memory –> which is significantly faster than main memory –> when two or more processes accesss same memory location –> they interfere with each other caching behavior –> this can throw cache misses –> so processor will access main memory again to retrieve data –> increasing number of cycles
–> with Flush+Reload –> we take advantage of such behavior by flushing cache line corresponding to a memory location + Reload it 00> if memory location was access relcently –> data will be inside cache –> reload is very fast than if data was retrieve from main memory –> this reduce the noise caused by cache misses –> more accurate measurement
Exploit ? How?
I demonstrated both such techniques as means of noise reduction for accuracy in measurements but they are cache attacks))) cool you learned two such attacks –> we will soon write a simple exploit in future articles inside series but now you have learn working with CPU –> like writing simple programs and optimize them (here we write a simple function for access array –> optimize him)
Flush+Reload > Prime+Probe?
–> we do not access memory address repeated times –> he is faster than Prime+Probe
–> more accuracy –> measuring time taking to reload cache line is nice indicator of memory access
–> when victim memory is frequently modified + small cache line size –> Flush+Reload is better here because he detect more fine access patterns in small time
etc
Prime+Probe > Flush+Reload?
–> multi core system –> different cores may have different copies of same cache line –> with Prime+Probe we can prime cache line on one core –> then probe it inside other core –> reveal information about victim memory access in other core
–> when victim memory is read only –> read only memory regions cannot be modified with Flush+Reload
etc
Prime+Probe
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <x86intrin.h>
#define CACHE_LINE_SIZE 64
#define ARRAY_SIZE (1 << 22)
#define ITERATIONS 10000
#define PRIME_PROBE_ROUNDS 256 // number of cache lines targeted
void prime_probe(uint8_t *ptr, int start, int end) {
volatile uint64_t sum = 0; // store sum of accessed cache lines
// flush --> specific cache lines
for (int i = start; i < end; i++) {
_mm_clflush(ptr + i * CACHE_LINE_SIZE);
}
// access flushed cache lines --> measure access time
for (int i = start; i < end; i++) {
sum += *(volatile uint64_t *)(ptr + i * CACHE_LINE_SIZE);
// ***access cache lines using a volatile pointer --> ensure actual memory accesses
}
}
int main() {
uint8_t *array = (uint8_t *)malloc(ARRAY_SIZE * CACHE_LINE_SIZE);
uint64_t start, end;
unsigned int garbage; // he will be unused --> because the compiler will optimize away the rdtscp instruction
// first measurement --> without prime probe
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
volatile uint64_t sum = 0;
for (int j = 0; j < ARRAY_SIZE; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtscp(&garbage);
printf("without Prime+Probe --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
// second measurement --> with prime probe
srand(time(NULL));
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
// random select a cache set --> prime and probe
int set_index = rand() % (ARRAY_SIZE / PRIME_PROBE_ROUNDS);
int prime_start = set_index * PRIME_PROBE_ROUNDS;
int prime_end = prime_start + PRIME_PROBE_ROUNDS;
prime_probe(array, prime_start, prime_end);
// access all cache lines --> inside selected cache set
volatile uint64_t sum = 0;
for (int j = prime_start; j < prime_end; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtscp(&garbage);
printf("with Prime+Probe --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
free(array);
return 0;
}
Flush+Reload
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <x86intrin.h>
#define CACHE_LINE_SIZE 64
#define ARRAY_SIZE (1 << 22)
#define ITERATIONS 100
#define THRESHOLD 200 // cache access threshold
void flush_cache_line(volatile uint8_t* ptr) {
_mm_clflush((const void*)ptr);
}
uint64_t reload_cache_line(volatile uint8_t* ptr) {
uint64_t start, end;
unsigned int aux;
start = __rdtscp(&aux);
volatile uint8_t value = *ptr;
end = __rdtscp(&aux);
return (end - start);
}
void flush_reload(uint8_t* array, int index) {
volatile uint8_t* ptr = &array[index * CACHE_LINE_SIZE];
flush_cache_line(ptr);
uint64_t access_time = reload_cache_line(ptr);
if (access_time < THRESHOLD) {
printf("cache hit at index %d --> access time --> %lu\n", index, access_time);
}
}
int main() {
uint8_t *array = (uint8_t *)malloc(ARRAY_SIZE * CACHE_LINE_SIZE);
uint64_t start, end;
unsigned int garbage; // unused variable
// first measurement --> without Flush+Reload
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
volatile uint64_t sum = 0;
for (int j = 0; j < ARRAY_SIZE; j++) {
sum += array[j * CACHE_LINE_SIZE];
}
}
end = __rdtscp(&garbage);
printf("without Flush+Reload --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
// second measurement --> with Flush+Reload
srand(time(NULL));
start = __rdtscp(&garbage);
for (int i = 0; i < ITERATIONS; i++) {
int index = rand() % ARRAY_SIZE;
flush_reload(array, index);
}
end = __rdtscp(&garbage);
printf("with Flush+Reload --> %lu cycles/iteration\n", (end - start) / ITERATIONS);
free(array);
return 0;
}
THE NOTE This article is for informational purposes only. We do not encourage you to commit any hacking. Everything you do is your responsibility.
TOX : 340EF1DCEEC5B395B9B45963F945C00238ADDEAC87C117F64F46206911474C61981D96420B72
Telegram : @DevSecAS
More from Uncategorized
Fortinet FortiOS / FortiProxy Unauthorized RCE
CVE-2024-21762 is a buffer overflow write vulnerability in Fortinet Fortigate and FortiProxy. This vulnerability allows an unauthorized attacker to execute …
Active Directory Dumper 2
We check the architecture for strength – an attempt to cram in the unintelligible – we fasten the network resource …
Active Directory Dumper
The purpose of this article is to show the use of the principles of building an application architecture. 1.1.1 What we …