Use the __builtin_popcount intrinsic to optimize the bit counting
function if the compiler supports it. This change replaces the manual
bit counting algorithm with the more efficient built-in function, which
leverages hardware support on compatible processors.
This modification ensures that the code remains backward-compatible by
falling back to the original implementation when __builtin_popcount is
not available.