Question 17
Domain 2: Explore data and run experimentsYou are working with a time series dataset in Azure Machine Learning using the Python SDK v2. You need to split the dataset into training and testing subsets while preserving the temporal order. Which of the following approaches should you use?
Correct answer: A
Explanation
"TimeSeriesSplit" is designed for time-ordered data because it creates sequential folds without shuffling, so earlier observations stay in training and later ones in testing. This preserves temporal order, which is essential for time series modeling and avoids leakage from future data into the past.
Why each option is right or wrong
A. Use sklearn.model_selection.TimeSeriesSplit to generate training and testing splits.
sklearn.model_selection.TimeSeriesSplit is the cross-validation splitter intended for ordered observations: it produces successive folds where each test fold comes after the corresponding training fold, rather than randomly partitioning rows. In scikit-learn, the default number of splits is 5, and the splitter preserves chronology by never shuffling, which is the required behavior when the Azure ML dataset represents a time series and future values must not leak into the training set.
B. Use random.shuffle on the dataset before splitting.
C. Use train_test_split with shuffle=True.
D. Use pandas.DataFrame.sample() to split the dataset.