A benchmark to test generalization capabilities of deep learning methods to classify severe convective storms in a changing climate