This is probably faster than the current implementation: https://lemire.me/blog/2018/02/21/iterating-over-set-bits-quickly/