The direct acquisition of the permeability of porous media by digital images helps to enhance our understanding of and facilitate research into the problem of subsurface flow. A complex pore space makes the numerical simulation methods used to calculate the permeability quite time-consuming. Deep learning models represented by three-dimensional convolutional neural networks (3D CNNs), as a promising approach to improving efficiency, have made significant advances concerning predicting the permeability of porous media. However, 3D CNNs only have the ability to represent the local information of 3D images, and they cannot consider the spatial correlation between 2D slices, a significant factor in the reconstruction of porous media. This study combines a 2D CNN and a self-attention mechanism to propose a novel CNN-Transformer hybrid neural network that can make full use of the 2D slice sequences of porous media to accurately predict their permeability. In addition, we added physical information to the slice sequences and built a PhyCNN-Transformer model to reflect the impact of physical properties on permeability prediction. In terms of dataset preparation, we used the publicly available DeePore porous media dataset with the labeled permeability calculated by pore network modelling (PNM). We compared the two transformer-based models with a 3D CNN in terms of parameter number, training efficiency, prediction performance, and generalization, and the results showed significant improvement. Combined with the transfer learning method, we demonstrate the superior generalization ability of the transformer-based models to unfamiliar samples with small sample sizes.