OR/21/006 Unsupervised cluster analysis

From MediaWiki
Jump to navigation Jump to search
Newell, A J, Woods, M A, Graham, R L, and Christodoulou, V. 2021. Derivation of lithofacies from geophysical logs: a review of methods from manual picking to machine learning. British Geological Survey Open Report, OR/21/006.

Contributor/editor: Kingdon, A

5.1 BACKGROUND Clustering (or cluster analysis) is a technique that finds groups of similar objects that are more related to each other than to objects in other groups. In the case of geophysical logs the clusters might relate to different lithofacies types. There are many different clustering algorithms but one of the more widely used as a method for deriving lithologies from geophysical logs is k- means clustering (Cerqueira et al. 2019). K-means clustering can be applied to two or more logs and does not use any training data to guide the formation of clusters. Training data might include predetermined knowledge of how geophysical response varies with lithofacies gained from examining borehole core or image logs. K-means clustering thus falls into the category of an unsupervised technique. In summary k-means clustering proceeds in four automated steps: 1. From the sample points, a pre-defined number of cluster centroids, k, are randomly picked as initial cluster centres 2. Each sample is assigned to the nearest centroid 3. Centroids are then relocated to the centre of the samples that were assigned to it. The similarity between points is based on the squared Euclidean distance between two points in m-dimensional space. 4. Repeat steps 2 and 3 until the cluster assignments do not change, or a user-defined tolerance or maximum number of iterations is reached.

A possible drawback of k-means is that the number of clusters (or lithofacies) must be specified before running the analysis. While the number of clusters may be obvious in datasets of only two log types or where the formations comprise only a few easily distinguishable rock types it may be less obvious when a larger number of logs are brought into the analysis. In such cases an ‘elbow plot’ showing sum squares of distances between each sample to the centre of its cluster group for a range of cluster numbers can be useful (Figure 25). The optimal number of clusters is generally thought to be at the point of maximum curvature. In poorly-known rock formations the elbow plot may give a useful initial indication of how many lithologies are present (or resolvable using the available logs) that is independent of any preconceived (and possibly erroneous) ideas of the geologist.

Figure 25. K-means cluster analysis of normalised gamma-ray and sonic log data for the Mercia Mudstone Group of the Winterborne Kingston borehole. Seven centroids have been pre- selected corresponding to the inflection point on the elbow plot.

5.2 PRACTICAL IMPLEMENTATION OF K-MEANS CLUSTERING K-means cluster analysis can be undertaken easily and rapidly on large datasets using open- source python tools such as those provided by scikit-learn (https://scikit- learn.org/stable/index.html). The geoapps project (https://pypi.org/project/geoapps/) created by Mira Geoscience has been used here. It includes the scikit-learn k-means clustering algorithm within a Jupyter-Notebook application that includes a range of Plotly visualisation tools to assess the results using histogram, box, scatter, inertia and cross-correlation plots (Figure 26).

Figure 26. K-mean clustering as implemented in Mira Geoscience geoapps Jupyter-Notebook. Inbuilt Plotly visualisation tools provide a highly interactive environment. 2D and 3D cross-plots show normalised gamma-ray, sonic and density curve data for the Mercia Mudstone Group of the Winterborne Kingston Borehole.

Results can be saved and displayed within the free Mira Geoscience ANALYST 3D viewer or exported as text files to use in other log handling applications.

5.3 EXAMPLE OF RESULTS Figure 27 shows the results of a k-mean cluster analysis for the Mercia Mudstone Group of the Winterborne Kingston borehole and highlights some of the similarities and differences with the previous cut-off analysis. While unsupervised cluster analysis of this type is unlikely to provide a definitive lithological classification of a borehole based on log data it is nonetheless a rapid and powerful method to derive insight into the dataset and could be a guide for additional supervised work or manual intervention and adjustment.

Figure 27. Results of k-means cluster analysis (right-hand track) performed in geo-apps and imported to SKUA-GOCAD. Note both similarities and differences between the cut-off analysis and cluster analysis.