When I was running mbw on an ARM platform, specifically a SABRE Lite development board, I noticed that the results using memcpy supplied by default libc library is much faster than the word-by-word copy. This is actually not a surprise. I was wondering how I could achieve similar speed. After some searches, I found that ARM provided an excellent document about What is the fastest way to copy memory on a Cortex-A8, which concludes that NEON memory copy with PLD is the fastest. I repeat the code below for your convenience. As you can see, PLD instruction tries to pre-fetch the data from memory. Please note that the offset in the PLD instruction, it means that the PLD is actually preparing data for the next round of VMDM and VSTM pair. The code hopes that the processor can overlap preparing data for next round of copy with current copy instructions, and hopefully when the next round of copy starts, the required data is already in cache. The following NEON instructions first load the data to 8 registers d0 to d7 from the source, and then the data is stored to destination memory. The exclamation marks after the r0 and r1 registers are used to increase the source and destination addresses in r0 and r1 automatically after each load and store.

NEON memcpy
NEONCopyPLD:
    PLD {r1, #0xc0]
    VLDM r1!, {d0-d7}
    VSTM r0!, {d0-d7}
    SUBS r2, r2, #0x40
    BGE NEONCopyPLD

As expected, the NEONCopyPLD does achieve higher memory bandwidth, but it is still not comparable with the libc memcpy. Of course I am so curious about the reason, so I compiled the mbw with static linking and disassembled the binary to find out why. The standard library memcpy uses ldm and stm instructions, which also operate on multiple general-purpose registers. Basically the code demonstrates the basic idea and skips a lot of checks on alignment or length. The main difference is that the code uses multiple pld instructions before actually copies the data. My understanding (I could be wrong) is that multiple pld instructions are using different execution units of the processor to pre-fetch data from memory while the ldm and stm instructions are copying data. One single pld may finish too early so that the pre-fetch unit is idling instead of preparing for the subsequent data. Please let me know if you know exactly what is going on under the hood. The pld instruction should not trigger synchronous data abort the address to be pre-fetched can not be translate to a physical address.

LDM/STM memcpy
void *memcpy(void *dest, const void *src, size_t n)
{
    __asm__ __volatile__(
            "push   {r3-r10}        \n"
            "1:                     \n"
            "pld    [%1]            \n"
            "pld    [%1, #28]       \n"
            "pld    [%1, #60]       \n"
            "pld    [%1, #92]       \n"
            "pld    [%1, #124]      \n"
            "ldmia  %1!, {r3-r10}   \n"
            "stmia  %0!, {r3-r10}   \n"
            "ldmia  %1!, {r3-r10}   \n"
            "stmia  %0!, {r3-r10}   \n" 
            "subs   %2, %2, #0x40   \n"
            "bge    1b              \n"
            "pop    {r3-r10}        \n"
            :
            : "r"(dest), "r"(src), "r"(n)
            : "memory"
            );
}

The end result is that I achieve similar speed as the standard C library memcpy. Actually a little faster since the code skips so many checks. You may download the mbw benchmark and try it out yourselves. This experiment shows the importance of memory pre-fetching, and that is where the title comes from.