Spatial Sampling Bias in Decision Tree Machine Learning Method for Unconventional Resources
Machine learning methods, such as decision tree and random forest, are powerful methods for modeling complicated multivariate relationships and may be applied to productivity prediction and uncertainty characterization to support decision-making for unconventional reservoir development. These methods have been developed in a wide variety of applications with available dense or exhaustive sampling such as satellite imagery and process automation. In general subsurface modeling and forecasting exhibits sparse, non-representative sampling; therefore, it is necessary to account for spatial sampling bias in the construction of the prediction modeling when employing machine learning methods. In this study, polygonal declustering is integrated into a machine learning prediction workflow to mitigate spatial sampling bias with a decision tree. Polygonal declustering provides data weights based on the local data density. These weights may be applied to calculate representative statistics for predictive models. For decision tree, each segmentation is determined by greedy reduction of the residual sum of square (RSS) of the model compared to the data. Declustering the biased data set before partitioning removes the influence of spatially biased data from the model construction. The declustering weights are applied in estimating the prediction for each terminal node and during the tree growth determination of the next hierarchical binary segmentation to minimize the prediction error. Declustering could also be applied to other tree-based methods like bagging and random forests. Tree-based estimation is demonstrated due to the ease to interpret the results. The spatial weighted decision tree model are demonstrated with two predictor and one response features based on a synthetic, but realistic, 2D geological truth model. By evaluating the error reduction with respect to different degree of bias, the improvement of spatial weighted decision tree over a naïve tree model is quantified and a trend can be observed. It is shown that spatial sampling bias has a significant effect on the accuracy of the prediction model and that declustering is effective for correcting nonrepresentative data sets. It is recommended that data representativity is addressed for spatial machine learning prediction.
AAPG Datapages/Search and Discovery Article #90350 © 2019 AAPG Annual Convention and Exhibition, San Antonio, Texas, May 19-22, 2019