Attention doesn't work well for downsample_step=1 and outputs_per_step=1

Noticed while working on https://github.com/r9y9/deepvoice3_pytorch/pull/21.

Trained 300k steps, but the model was not generalized well. Need to figure out how we can improve.