The rate of background seismicity, or the earthquakes not directly triggered by another earthquake, in active seismic regions is indicative of the stressing rate of fault systems. However, aftershock sequences often dominate the seismicity rate, masking this background seismicity. The identification of aftershocks in earthquake catalogs, also known as declustering, is thus an important problem in seismology. Most solutions involve spatio-temporal distances between successive events, such as the Nearest-Neighbor-Distance algorithm widely used in various contexts. This algorithm assumes that the space-time metric follows a bi-modal distribution with one peak related to the background seismicity and another peak representing the aftershocks. Constraining these two distributions is key to accurately identify the aftershocks from the background events. Recent work often uses a linear-splitting based on nearest-neighbor distance threshold, ignoring the overlap between the two populations and resulting in a mis-identification of background earthquakes and aftershock sequences. We revisit this problem here with both machine-learning classification and clustering algorithms. After testing several popular algorithms, we show that a random forest trained with various synthetic catalogs generated by an Epidemic Type Aftershock Sequence model outperforms approaches such as K-means, Gaussian-mixture models, and Support Vector Classifications. We evaluate different data features and discuss their importance in classifying aftershocks. We then apply our model to two different actual earthquake catalogs, the relocated Southern California Earthquake Center catalog and the GeoNet catalog of New Zealand. Our model capably adapts to these two different tectonic contexts, highlighting the differences in aftershock productivity between crustal and intermediate depth seismicity.