Use multiple load store instructions for 32 byte chunks in ARM-specific
blit-function, analog to x86 variant. Make the blit-function of x86 a
generic one, and provide needed utility functions for ARM and generic code.
Please refer issue #147 for discussion.