Added CPU specific memcpy function ('memcpy_cpu'), which is tried first in default 'memcpy'. Improved default 'memcpy' to copy eight byte chunks.