Many statistical downscaling methods require observational inputs and expert knowledge and thus cannot be generalized well across different regions. Convolutional neural networks (CNNs) are deep-learning models that have generalization abilities for various applications. In this research, we modify UNet, a semantic-segmentation CNN, and apply it to the downscaling of daily maximum/minimum 2-m temperature (TMAX/TMIN) over the western continental United States from 0.258 to 4-km grid spacings. We select high-resolution (HR) elevation, low-resolution (LR) elevation, and LR TMAX/TMIN as inputs; train UNet using Parameter-Elevation Regressions on Independent Slopes Model (PRISM) data over the south- and central-western United States from 2015 to 2018; and test it independently over both the training domains and the northwestern United States from 2018 to 2019. We found that the original UNet cannot generate enough fine-grained spatial details when transferred to the new northwestern U.S. domain. In response, we modified the original UNet by assigning an extra HR elevation output branch/loss function and training the modified UNet to reproduce both the supervised HR TMAX/TMIN and the unsupervised HR elevation. This improvement is named "UNet-Autoencoder (AE)." UNet-AE supports semisupervised model fine-tuning for unseen domains and showed better gridpoint-level performance with more than 10% mean absolute error (MAE) reduction relative to the original UNet. On the basis of its performance relative to the 4-km PRISM, UNet-AE is a good option to provide generalizable downscaling for regions that are underrepresented by observations.