OR/21/006 Facies classification using supervised machine learning
|Newell, A J, Woods, M A, Graham, R L, and Christodoulou, V. 2021. Derivation of lithofacies from geophysical logs: a review of methods from manual picking to machine learning. British Geological Survey Open Report, OR/21/006.
Contributor/editor: Kingdon, A
6.1 BACKGROUND Machine learning is essentially a set of data-analysis methods that includes classification, clustering, and regression. Machine learning algorithms can be used to discover similarities and trends within large complex datasets without being explicitly programmed, in essence learning from the data itself. Machine learning methods would appear well-suited to the task of deriving lithology categories from geophysical log data. In borehole datasets there are often boreholes where the lithology or lithofacies is known from expert-descriptions of core and other boreholes where only geophysical logs are available. Boreholes (or parts of boreholes) where core and geophysical logs coexist can be used as training data for the machine learning algorithm and the relationships that are established can then be applied to unknown borehole sections. Growth in the use of machine learning methods for facies classification of log data has occurred in parallel with the availability of many open-source packages, much of which used to be only available in proprietary software platforms. The best known general example is scikit-learn (http://scikit-learn.org/) a collection of machine learning tools coded in Python. These tools form the core of many Jupyter notebooks which have been specifically compiled for the purpose of classifying geophysical log data (e.g. https://github.com/brendonhall/facies_classification). This notebook (which employs a support vector machine or SVM) has been trialled for the purpose of this report and forms the basis for the discussion below. Many other notebooks using alternative supervised-learning algorithms such as random forest classification are available but have not yet been trialled (e.g. https://github.com/seg/2016-ml-contest).
6.2 INPUT DATA The preparation of input data is straightforward and is a simple table comprising a series of log measurements with a known lithology or lithofacies class (Table 5). Here the training values represent short intervals of the borehole where the determination of lithology from the log response is clear from log response and cuttings. Ideally of course the training dataset should be based on core (or high-quality image logs) and might be assembled from multiple boreholes which prove the same formation within a basin.
Table 5. Example format of training data. In this case a subset of the geophysical measurements (INPUT_FACIES) is used to classify the entire stratigraphic interval (SVM_CLASS). The training data would often be based on expert descriptions of core.
6.3 RUNNING THE CLASSIFICATION After loading the input data file the Jupyter notebook takes the user through the SVM classification process in a step-by-step way with blocks of runnable Python code, text providing instructions and commentary and interactive tabular and graphical outputs (Figure 28).
Figure 28. Part of the Jupyter notebook for facies classification (https://github.com/brendonhall/facies_classification) showing the typical mixture of runnable Python code, text providing instructions and commentary and interactive tabular output (or graphical plots).
The process breaks down into three stages.
1. Examining the input data After the text file of training data are loaded some descriptive statistics are generated and there are options to plot and view the log data in conventional tracks or as cross-plots between different types of log measurements coloured according to lithofacies. 2. Splitting the dataset A standard practice when training supervised-learning algorithms is to separate some data from the training set to evaluate the accuracy of the classifier. For example, one or more wells could be removed in a multi-borehole training dataset to act as test data (Figure 29).These test data play no part in the training or cross-validation of the SVM classifier.
Figure 29. Splitting the training dataset into training data and test data
3. Standardise the dataset Geophysical log measurements record a range of properties in different units of widely varying magnitude (e.g. see the summary statistics table in Figure 28). Many machine-learning algorithms however assume that all features are centred around zero and have variance in the same order. If one log type has a variance that is orders of magnitude larger than the others it will dominate the function and impede learning from the other features. For this reason all log data (including both training set and later input data) must be standardised to similar scale and deviation. This can be undertaken using the StandardScalar function in Scikit-learn. 4. Training the support vector classifier Training the support vector classifier is an optimization process. The SVM classification learns from the training dataset the projection into a higher dimensional space where classes can be separated by a hyperplane (or set of hyperplanes) that maximizes the margin separating the classes. Hyperplanes are higher-dimensional generalisations of a plane. In 2D the hyperplane or decision boundary corresponds to a line (Figure 30). New uncategorised samples are classified according to the side of the hyperplane on which they fall when projected into the same space. “Soft margin” classification can accommodate some classification errors on the training data, in the case where data is not perfectly separable.
Figure 30. 2D example of a linear boundary that maximises the margin between the closest pair of data points belonging to two classes. The support vectors are the points on the dashed lines. Modified from Wilimitis (2018).
A cross-validation dataset is used to tune the parameters of the training model and is created by randomly splitting the training data into subsets (Figure 29). The SVM implementation in scikit-learn takes a number of parameters which control the learning rate and the specifics of the kernel functions which map the original non-linear inseparable observations into a higher-dimensional space in which they become separable. A succession of models is created with different parameter values and the combination with the lowest cross-validation error is used for the classifier. 5. Evaluating the classifier To evaluate the accuracy of the classifier the borehole that was set aside at the beginning of the process can be used to compare the predicted facies with the actual ones which may have been determined from expert description of core. A range of accuracy metrics are calculated, or the observed and predicted facies can simply be visually compared in adjacent log tracks. As demonstrated by periodic machine learning competitions (e.g. https://github.com/seg/2016-ml- contest) using hidden control boreholes, SVM (and related supervised ML approaches) while sophisticated in their approach do not generate completely accurate results with F-1 values ranging from 0.4 to 0.6. 6. Applying the classifier Once created the SVM classifier can be applied to other boreholes with a comparable range of lithofacies (e.g. they are part of the same formation) and have a similar array of geophysical logs. These will need to be rescaled using the same parameters used to rescale the training set. The results can be saved as text files and loaded into other log handling or geological modelling software (Figure 31).
6.4 EXAMPLE OF RESULTS An example of the output is shown in Figure 31 for the Winterborne Kingston borehole. Here the input data were simply short extracts from three log types attributed with a lithofacies (shown in the input facies column). SVM is here used simply to upscale these inputs to the entire log and here (with well-defined lithologies) appears to produce meaningful results.
Figure 31. Example of SVM classifier