Use shuffle() implementation from zlib_into
This should compress data faster than shuffling with NumPy. We're making a similar change in pycalibration: calibration/pycalibration!1163 (merged)
At present the function only works for 2/4/8 byte item sizes, which are what we work with. It could be made generic in the future, although the performance wouldn't be as good for arbitrary sizes.