GeoConvention 2018
Data Science Tools for Petroleum Exploration and Production

Preamble
Description of the Example Dataset
Is There A Difference in Production?
Distance Correlation
Smoothers
Variable Selection
Sample Size and Power Calculations
Regression
- Non Linearities
- Linear Model
Colophon

#-----------------------------------------------------------------------------------------#
# NEEDED PACKAGES
#-----------------------------------------------------------------------------------------#
# library("packrat")
library("Hmisc")
library("rms")
library("ggplot2")
library("tidyr")
library("stringr")
library("pander")
library("extrafont")
library("plotly")
library("viridis")
library("readr")
library("knitr")
library("kableExtra")
library("tibble")
library("broom")
library("patchwork")
library("DT")
library("doParallel")


# font_import(pattern="[O/o]swald")
# font_import(pattern="PT_Sans")
# font_import(pattern="[R/r]aleway")
# loadfonts(device="win")

Preamble

This tutorial is meant to accompany a talk given by Matteo Niccoli and myself (Thomas Speidel) at the 2018 geoconvention in Calgary, Canada titled Data science tools for petroleum exploration and production. Many of the ideas were inspired by Nicooli’s past work. See for instance, Machine learning in geoscience with scikit-learn - notebook 1.

This tutorial was created using the open-source language R. However, most of the methods illustrated can be done in Python. See Niccoli’s implementation of some of these methods in the Python environment. See for instance the implementation of distance correlationin Python.

Description of the Example Dataset

We will illustrate a number of methodologies on the data set by Hunt¹. Niccoli² illustrated a number of statistical techniques on this data set in a series of notebooks. As stated: “the target variable to be predicted, Y, is oil production (measured in tens of barrels of oil per day) from a marine barrier sand”. Below, is a sample of the data:

#-----------------------------------------------------------------------------------------#
# READ, LABEL, FORMAT DATA
#-----------------------------------------------------------------------------------------#
hunt <- read_csv("Table2_Hunt_2013_edit.csv")

#-----------------------------------------------------------------------------------------#
# MAKE SYNTACTICALLY VALID COL NAMES
#-----------------------------------------------------------------------------------------#
var.names <- tolower(colnames(hunt))
var.names <- make.names(var.names, unique = TRUE, allow_ = FALSE)
colnames(hunt) <- var.names

## Sample
options(DT.fillContainer = FALSE)
datatable(hunt,
          fillContainer = FALSE,
          style = 'bootstrap', 
          class = 'table-hover table-condensed stripe',
          extensions = 'Buttons',
          rownames = FALSE,
          filter = 'top',
          options = list(
            keys = TRUE,     
            autoWidth = TRUE,
            pageLength = 10,
            searching = FALSE, 
            dom = 'Bfrtip',
            searchHighlight = TRUE,
            buttons = c('copy', 'csv', 'excel', 'print'))) %>%
  formatStyle("production",
    background = styleColorBar(hunt$production, '#99cfca'),
    backgroundSize = '100% 80%',
    backgroundRepeat = 'no-repeat')

#-----------------------------------------------------------------------------------------#
## ADD CATEGORICAL POSITION
#-----------------------------------------------------------------------------------------#
hunt %>%
  mutate(position.cat = cut2(position, c(1, 2, 3))) %>%
  as.tibble() -> hunt

¹ Hunt, L. (2013), Many correlation coefficients, null hypotheses, and high value. CSEG Recorder, 38 (10)
² Niccoli, M (2017), Machine learning in geoscience with scikit-learn - notebook 2

## Data
d <- hunt %>% describe()
html(d, size=100, scroll=FALSE)
plot(d)

Is There A Difference in Production?

Suppose we are interested in assessing whether production changes according to position. The conventional approach is to perform ANOVA or t-test. However, both relies on normality and are sensitive to outliers. When comparing 2 samples, we can utilize the Wilcoxon test, while fore more than samples, the Kruskal-Wallis test is a generalization of the Wilcoxon test.

First, let’s assess the distribution graphically. From the boxplot below, it is not apparent whether there is a difference in production between positions:

g1 <- ggplot(hunt, aes(x = position.cat, y = production)) +
  geom_boxplot(fill = "grey60", alpha = 0.6, color = "grey60") +
  scale_x_discrete() + 
  xlab("Position") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))
  
## Add lines
dat <- ggplot_build(g1)$data[[1]]
g1 + geom_segment(data=dat, aes(x=xmin, xend=xmax, y=middle, yend=middle), color = "coral2", size=1)

Next, we perform the Wilcoxon test:

# kw <- tidy(anova(ols(rank(production) ~ position.cat , data = hunt))) %>%
#   rename(` ` = `.rownames`)

# kable(kw, "html", align = "l", digits = 3) %>% 
#   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
#     row_spec(1:1, bold = T)

w1 <- tidy(wilcox.test(production~position.cat, data = hunt, conf.int=TRUE, distribution='exact', conf.level = .90))

kable(w1, "html", align = "l", digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left") %>%
    row_spec(1:1, bold = T)

estimate	statistic	p.value	conf.low	conf.high	method	alternative
14.98	83	0.041	4.64	27.3	Wilcoxon rank sum test	two.sided

A P-value of 0.041, tells us that the probability of observing a difference in distribution equal to or more extreme than the one observed is 4.1%. In other words, it’s somewhat unlikely but not impossible that the difference in production is due to chance variation.

Distance Correlation

Measures of correlation quantify the strength of the relationship between pairs of variables. The traditional Pearson correlation has several limitations, one of which is that it assumes a linear relationship. The Spearman correlation is generally preferred and avoids some of these limitations. A relatively new and powerful measure of correlation is the distance correlation.

#-----------------------------------------------------------------------------------------#
# DISTANCE CORRELATION
#-----------------------------------------------------------------------------------------#
##Corr matrix
correlations <- sapply(1:8, function(r) {
  sapply(1:8, function(c) {
    energy::dcor(hunt[,r], hunt[,c])
    })
})
colnames(correlations) <- rownames(correlations) <- colnames(hunt)[1:8]

## P-values (2000 bootstrap resamples)
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- energy::dcov.test(mat[, i], mat[, j], index = 1.0, R=2000)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
p.mat <- cor.mtest(correlations)


## Plot corrmatrix
library(corrplot)
par(family = 'Raleway',
    pin = c(3,6), 
    ps = 9,
    mar = c(0.1, 0.1, 0.1, 0.1))

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(correlations,
         method = "color",
         col = col(200),
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         p.mat = p.mat,
         sig.level = 0.10,
         insig = "blank",
         diag = FALSE,
         bg = "grey90",
         outline = TRUE,
         addgrid.col = "white",
         cl.lim = c(0, 1),
         tl.cex = 1.0,
         cl.cex = 0.8,
         cl.pos = "n")

Above is the distance correlation matrix. Colored boxes represent statistically significant correlations at the 10% significance level. The correlation measure is able to detect that gross.pay is highly correlated to gross.pay.transform (they are algebraically related). Also, gross.pay is highly correlated with production. Notice that, unlike other correlation measures, distance correlation doe not compute direction (positive or negative correlation).

A Note on the Choice of Correlation

Among measures of correlations capable of detecting non-linear association, MIC (maximal information criterion, Reshef et al.) has been touted as a powerful measures. However, several authors Kinney², Simon³ point to a number of problems with MIC and suggest to utilize distance correlation instead:

“We believe that the recently proposed distance correlation measure of Székely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use”.

¹ Rashef et al. (2011), Detecting novel associations in large data sets
² Kinney et al. (2013), Equitability, mutual information, and the maximal information coefficient
³ Simon et al. (2014), Comment on "Detecting Novel Associations In Large Data Sets

Smoothers

Correlation matrices are useful to visualize combinations of pairwise correlations. There is, however, more information to be had by plotting the data and adding a trend line. One of the most flexible is Cleveland’s LOESS (locally weighted scatter plot smoothing). Let’s plot the LOESS curves for each variable against production, separate for each position.

g1 <- ggplot(hunt, aes(y = production, x = gross.pay, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Gross Pay") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))

g2 <- ggplot(hunt, aes(y = production, x = phi.h, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Phi-h") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))


g3 <- ggplot(hunt, aes(y = production, x = pressure, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Pressure") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))

g4 <- ggplot(hunt, aes(y = production, x = random.1, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Random Var") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))

g5 <- ggplot(hunt, aes(y = production, x = gross.pay.transform, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Gross Pay (transformed)") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))


g1+g2+g3+g4+g5

Scatter plots of production against each variable by position (red: blue:). A LOESS curve was added to highlight the overall trend.

Variable Selection

Which variables should we focus on in understanding what drives changes in production? Which variables are measuring the same or similar underlying quantities?

A statistically principled approach is to start by eliminating variables based on domain knowledge. This may be assisted by statistical data reduction methods such as clustering and redundancy analysis.

Clustering

Here we assess clusters of independent variables. We do so blinded to the response variable, production, in order to avoid creating bias. In cluster analysis, the objective is to identify variables that are measuring the same underlying phenomenon. If they exists and supported by domain knowledge, one could argue for removing one of them.

vc <- varclus (~ gross.pay + phi.h + position.cat + pressure + random.1 + random.2 + gross.pay.transform,
              sim = 'hoeffding', 
              data = hunt)

par(family = 'Raleway',
    ps = 9,
    par(bg = '#FCFCFC'),
    mar = c(0.2, 4, 1.5, 1.5))

plot(vc)

Hierarchical cluster dendogram of all variables except production. The similarity matrix is based on the Hoeffding D statistics which will detect non-monotonic associations.

There are numerous methods to perform hierarchical clustering. It’s good practice to try different methods to see if the results are in the same ballpark. Let’s do that with Hierarchical Clustering with P-Values via Multiscale Bootstrap¹:

## pvclust
library(pvclust)
hunt2 <- select(hunt, -production, -position.cat)
cluster.bootstrap <- pvclust(hunt2, 
                             nboot = 5000, 
                             method.dist = "abscor",
                             parallel = TRUE,
                             iseed = 123)

## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.

par(family = 'Raleway',
    par(bg = '#FCFCFC'),
    ps = 9)
plot(cluster.bootstrap)
pvrect(cluster.bootstrap)

From either cluster dendograms we can see how gross.pay and gross.pay.transform cluster together and, to a lesser extent, phi.h. Based on examination of these results, the natural choice is to remove either gross.pay or gross.pay.transform from further analyses. Suppose we don’t know gross.pay.transform is algebraically related to gross.pay: which of the two should we exclude? Redundancy analysis, helps us determine that.

¹Suzuki et al. (2006), Pvclust: an R package for assessing the uncertainty in hierarchical clustering

Redundancy Analysis

redun <- redun(~ gross.pay + phi.h + I(position.cat) + pressure + random.1 + random.2 + gross.pay.transform,
                r2 = 0.70,
                type = 'adjusted',
                tlinear = FALSE,
                iterms = TRUE,
                pc = TRUE,
                data = hunt)

In redundancy analysis we look at which variable can be predicted with high confidence from any combination of the other variables. Those variables can then be safely omitted from further analysis. This approach is more robust than the pairwise correlation measures from before. We’ve set the adjusted R² value at 0.7.

We started with:

redun$In

##  [1] "gross.pay'"          "phi.h'"              "I(position.cat)"    
##  [4] "pressure"            "pressure'"           "random.1"           
##  [7] "random.1'"           "random.2"            "random.2'"          
## [10] "gross.pay.transform"

The model suggests that:

redun$rsquared

##            gross.pay gross.pay.transform'                phi.h 
##            0.9731901            0.8916161            0.7507618

can be removed because they are predicted from all the remaining variables. The numbers represent the R² with which a variable ican be predicted from all other remaining ones. However, we need to be very cautious, since the sample size is much too low to reliably suggest which variables can be omitted.

LASSO

The previous methods achieved variable reduction without consideration of the response variable, production. This is a sound approach, however, we may wish to do variable selection within a regression model. LASSO (least absolute shrinkage and selection operator) achieves that by shrinking the model coefficients. The primary purpose of shrinkage methods is that of improving predictive accuracy. However, if coefficients are shrunk to zero, the LASSO effectively achieves variable selection.

Fitting a LASSO model to the data, yield the following shrunk coefficients.

library(glmnet)
library(plotmo)

position.cat <- model.matrix(hunt$production ~ hunt$position.cat)[, -1]
x <- as.matrix(data.frame(hunt$gross.pay, hunt$phi.h, hunt$pressure, hunt$random.1, hunt$random.2, hunt$gross.pay.transform, position.cat))

## LASSO
registerDoParallel(cores=8)
lasso <- glmnet(x, 
                y = hunt$production,
                alpha = 1,
                family="gaussian",
                standardize = TRUE)

set.seed(123)
cv.lasso <- cv.glmnet(x, 
                      y = hunt$production, 
                      standardize = TRUE,
                      type.measure = "mse",
                      nfolds = 21,
                      parallel = TRUE,
                      alpha = 1)

lambda_min <- cv.lasso$lambda.min
lambda_1se <- cv.lasso$lambda.1se


## Make table of coeff
coef <- as.data.frame(as.matrix(coef(cv.lasso, s = lambda_1se))) %>% 
  rename(`Shrinked Coefficient` = !!names(.[1])) %>%
  rownames_to_column(var = "Variable") %>%
  mutate(`Shrinked Coefficient` = ifelse(`Shrinked Coefficient` == 0, NA, `Shrinked Coefficient`)) %>%
  mutate(Variable = gsub("hunt.", "", Variable))
  
kable(coef, "html", align = "l", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, position = "left")

Variable	Shrinked Coefficient
(Intercept)	-4.278
gross.pay	1.032
phi.h	0.116
pressure	1.332
random.1	NA
random.2	0.010
gross.pay.transform	NA
position.cat	-6.544

Looking at the table of coefficients, notice how hunt.random.1 and gross.pay.transform have no coefficient. That’s because they have been shrunk to zero, thus achieving variable selection. Our model suggests that these two variable are not useful in explaining changes in production. The coefficient for hunt.random.2 is also nearly zero, so we could remove it as well.

To gain more insight, it helps to visualize how fast coefficients are shrunk to zero (thus, excluded from the model) as we increase the amount of shrinkage, \(\lambda\).

## Prepare data for ggplot
## Could have used (not pretty)
# par(family = 'Raleway',
#     ps = 9,
#     par(bg = '#FCFCFC'),
#     mfrow = c(1, 1),
#     mar = c(3, 2.5, 1.5, 1.5))
# 
# plot_glmnet(lasso, 
#             s = cv.lasso$lambda.min,
#             xlab = "Log Lambda")

beta <- coef(lasso)

lasso.data <- as.data.frame(as.matrix(beta)) %>%
  tibble::rownames_to_column(var = "coef") %>%
  reshape::melt(id = "coef") %>%
  mutate(variable = as.numeric(gsub("s", "", variable))) %>%
  mutate(lambda = lasso$lambda[variable + 1]) %>%
  mutate(loglambda = log(lambda)) %>%
  mutate(norm = apply(abs(beta[-1,]), 2, sum)[variable+1]) %>%
  mutate(coef = gsub("hunt.", "", coef))

plotly::ggplotly(ggplot(lasso.data[lasso.data$coef != "(Intercept)",], aes(loglambda, value, color = coef, linetype = coef)) + 
  geom_line(size = 0.5) + 
  xlab("Lambda (log scale)") +
  ylab("Value of Coefficient") +
  scale_x_reverse() +
  guides(color = guide_legend(title = ""), 
         linetype = guide_legend(title = "")) +
  scale_color_viridis_d() +
  geom_vline(xintercept = cv.lasso$lambda.min, color = "coral2", size = 0.5) +
  theme_minimal(base_family = "Raleway", base_size = 12) +
  theme(legend.key.width = unit(3, "lines")) +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA)) +
  annotate("text",
           x = cv.lasso$lambda.min+0.25, 
           y = -8, 
           label = round(cv.lasso$lambda.min, 3), 
           angle = 90,
           parse = TRUE))

Moving from left to right, we see that if \(log(\lambda)\) (x-axis) is sufficiently large, all coefficients are shrunk to zero (y-axis), meaning all variables have been excluded. As we move to the right and \(log(\lambda)\) decreases, gross.pay enters the model, followed by phi.h. Moving to the far right, we see that when \(log(\lambda)\) decreases (i.e. \(\lambda\) approaches 0), little shrinkage occurs and no variable selection occurs.

Throughout the graph, random.1, random.2, remain consistently close to the zero line, suggesting that no matter the value of \(\lambda\), little weight is ever given to either. The vertical line at 0.052 is the value of \(log(\lambda)\) chosen by cross-validation that minimizes the loss function (this is what was used to generate the table above).

For this model the loss function is the mean squared error (MSE). The value of \(\lambda\) minimizing the loss function is depicted by the red line and chosen via cross-validation.

It is helpful to visualize how MSE changes as a function of \(\lambda\), but more crucially, to visualize the error bars of each MSE. This graph should give us pause in picking a \(\lambda\) value, since the majority of \(\lambda\) < 0.052 are nearly equivalent.

## Plot results
par(family = 'Raleway',
    ps = 9,
    par(bg = '#FCFCFC'),
    mfrow = c(1, 1),
    mar = c(3, 2.5, 2, 1.5))

plot(cv.lasso, xlab = "Log Lambda")

Recursive feature elimination

There is no shortage of methods on variable selection coming from the Machine Learning literature. Recursive feature elimination¹ is a model-based method. The technical details are beyond the scope of this article.

The majority of model-based variable selection methods perform better with normalized data (subtract the mean and divide by the standard deviation).

## Recursive feature elimination
library(Hmisc)
library(randomForest)
library(caret)

## Parallelize
registerDoParallel(cores=8)


## First normalize
normalization <- preProcess(x)
x <- predict(normalization, x)
x <- as.data.frame(x)
subsets <- c(1:7)


set.seed(123)

ctrl <- rfeControl(functions = lmFuncs,        ##use lm functions due to very small sample size
                   method = "LOOCV",           ##be mindful of very small sample size 
                   returnResamp = "all",
                   repeats = 100,
                   verbose = FALSE,
                   allowParallel = TRUE)

lmProfile <- rfe(x, 
                 hunt$production,
                 sizes = subsets,
                 rfeControl = ctrl)

## Table of results
kable(as.data.frame(lmProfile$results), 
      "html",
      align = "l",
      digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(4, bold = TRUE, color = "black", background = "#ccf5ec")

Variables	RMSE	Rsquared	MAE
1	9.313	0.614	7.588
2	8.247	0.732	6.697
3	6.613	0.828	5.464
4	3.770	0.937	3.015
5	4.078	0.925	3.293
6	4.049	0.926	3.184
7	4.012	0.928	3.091

From the table we can see a model with 4 variables achieves the highest R² and the lowest RMSE and highest R². Their standard deviation is also among the lowest. Those 4 variables are:

caret::predictors(lmProfile)

## [1] "hunt.phi.h"     "hunt.pressure"  "hunt.gross.pay" "position.cat"

We can, of course, plot the results for more immediacy:

library(caret)
trellis.par.set(caretTheme())
trellis.par.set(grid.pars = list(fontfamily = 'Raleway',
                                 par(bg = '#FCFCFC'),
                                 ps = 9,
                                 mar = c(0.1, 0.1, 0.1, 0.1)))

plot1 <- plot(lmProfile, type = c("g", "o"), col = "#F8766D", lwd = 2)
plot2 <- plot(lmProfile, type = c("g", "o"), metric = "Rsquared", col = "#F8766D",lwd = 2)
print(plot1, position = c(0, 0, 0.495, 1), more=TRUE)
print(plot2, position = c(0.505, 0, 1, 1))

¹ Guyon et al., Gene Selection for Cancer Classification using Support Vector Machines

Exhaustive Search

Another option to consider is that of performing an exhaustive search throughout the model space. Exhaustive search methods can be computationally prohibitive as “with more than a thousand models when p = 10 and a million when p = 20”. Tarr et al. (2018) illustrate an implementation of exhaustive search through many bootstrap resamples: “if there exists a “correct” model of a particular model size it will be selected overwhelmingly more often than other models of the same size".

This approach is intended the assist the analyst with the choice of model, rather than providing with a list of selected variables.

library(mplot)

f1 <- lm(production ~ gross.pay + phi.h + I(position.cat) + pressure + random.1 + random.2 + gross.pay.transform, data = hunt)
vis <- vis(f1, B = 250, redundant = TRUE, nbest = 5, seed = 123) 

p1 <- plot(vis, interactive = FALSE, highlight = "random.1", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank()) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p2 <- plot(vis, interactive = FALSE, highlight = "random.2", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank()) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p3 <- plot(vis, interactive = FALSE, highlight = "gross.pay", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank()) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p4 <- plot(vis, interactive = FALSE, highlight = "phi.h", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank()) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p5 <- plot(vis, interactive = FALSE, highlight = "pressure", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank()) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p6 <- plot(vis, interactive = FALSE, highlight = "gross.pay.transform", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank()) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p7 <- plot(vis, interactive = FALSE, highlight = "I.position.cat..2.3.", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 10) + theme(legend.position="bottom") + theme(legend.title=element_blank())

p1 + p2 + p3 + p4 + p5 + p6 + p7 + plot_layout(ncol = 2)

The plots above illustrate what happens to the model fit (log-likelihood) when we remove a variable (higher is better). For instance, in the upper left plot, there’s a clear separation in model fit when we remove random.1, so that resamples that do not contain random.1 perform better than model that do contain random.1.

p1 <- plot(vis, interactive = FALSE, highlight = "random.1", which = "boot") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p2 <- plot(vis, interactive = FALSE, highlight = "random.2", which = "boot") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p3 <- plot(vis, interactive = FALSE, highlight = "gross.pay", which = "boot") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p4 <- plot(vis, interactive = FALSE, highlight = "phi.h", which = "boot") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p5 <- plot(vis, interactive = FALSE, highlight = "pressure", which = "boot") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p6 <- plot(vis, interactive = FALSE, highlight = "gross.pay.transform", which = "boot") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA))
p7 <- plot(vis, interactive = FALSE, highlight = "I.position.cat..2.3.", which = "lvk") + theme_minimal(base_family = "Raleway", base_size = 9) + theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)) + theme(legend.position="bottom") + theme(legend.title=element_blank())

p1 + p2 + p3 + p4 + p5 + p6 + p7 + plot_layout(ncol = 2)

Finally, we can investigate how variables are included as the penalty increases. In the plot RV is a random variable that can be used as reference.

plotly::ggplotly(plot(vis, interactive = FALSE, which = "vip") + 
  scale_color_viridis(discrete = TRUE, option = "D")  + 
  labs(color = "") +
  theme_minimal(base_family = "Raleway", base_size = 12) + 
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)))

As the amount of penalty increases, it become clearer what variables remain: lposition.cat is the strongest predictor no matter the penalty. The three random variables and gross.pay.transform (which contained noise) are the fastest to fall out from bootstrap resamples.

library(mplot)
af <- af(f1, 
         B = 500,
         cores = 8,
         n.c = 100, 
         seed = 123) 

plotly::ggplotly(plot(af, best.only = TRUE, legend.position = "right", model.wrap = 4) + 
  scale_color_viridis(discrete = TRUE, option = "D") +
  labs(color = "") +
  theme_minimal(base_family = "Raleway", base_size = 12) +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)))

p is the proportion of times a given model is selected for a given c value.

Sparse Principal Components

Principal components have a long and rich history in statistics and several applied sciences. Principal components were first proposed by Harold Hotelling in the 1930’s. Principal components allow us to describe our data with fewer new variables. These new variables are linear combinations of multiple variables. We call these new variables principal components. These components are derived in such a way to maximize the variation across variables. For a wonderful explanation of principal components see this CrossValidated discussion.

Sparse principal components use ideas similar to that of the LASSO in the sense that instead of utilizing all variables, only some are for some components (hence the term sparse). Therefore, as with the LASSO, some variables are shrunk to zero. Let’s take a look at the loadings, that is, the weights of the linear combinations¹:

library(pcaPP) # robust PCA based on projection pursuit

## Identify sparness parameter and plot
oTPO <- opt.TPO(select(hunt, -production, -position),
                 k.max = 4, 
                 method = "sd", 
                 center = median, 
                 scale = sd, 
                 maxiter = 3000)

lambda <- oTPO$pc$lambda

## Plot sparness parameter
# par (mfrow = c (2, 4))
# for (i in 1:4)
#   plot (oTPO, k = i)
# for (i in 1:4)
#   plot(oTPO, k = i, f.x = "lambda")


## Sparse PC
set.seed (12)
spc <- sPCAgrid(select(hunt, -production, -position),
                 k = 4, 
                 lambda = lambda, 
                 method = "sd", 
                 center = median, 
                 scale = sd, 
                 scores = TRUE,
                 maxiter = 50) ##Sparse

kable(spc$loadings[1:7,],
  "html", align = "l", digits = 4, row.names = TRUE) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = T, position = "center")

	Comp.1	Comp.2	Comp.3	Comp.4
gross.pay	0.5990	0.0000	0.0000	0.0000
phi.h	0.5450	-0.0814	0.0000	0.0000
pressure	0.0000	-0.3480	0.0000	0.8804
random.1	0.0000	0.2194	0.9449	0.0000
random.2	0.0000	-0.6331	0.3275	0.0000
gross.pay.transform	0.5867	0.0756	0.0000	0.0000
position.cat	0.0000	0.6462	0.0000	0.4742

Notice how, in the loadings matrix, some weights were shrunk to zero. gross.pay is assigned the greatest weight in PC1, while position.cat is assigned the greatest loading in PC2. Next, we are going to visualize the PC’s:

## Plot PC
library(ggbiplot)

plotly::ggplotly(ggbiplot(spc, 
         obs.scale = 1, 
         var.scale = 1,
         ellipse = TRUE, 
         circle = FALSE,
         alpha = 0,
         varname.size = 4.5, 
         labels.size = 6) +
  geom_point(aes(colour =  cut2(hunt$production, g = 4)), size = 6, alpha = 0.75) +
  geom_text(aes(label = rownames(hunt)), hjust = 0, vjust = 0, size = 4) +  
  scale_color_viridis_d(direction = -1) +
  scale_x_continuous(limits = c(-60, 60)) +
  scale_y_continuous(limits = c(-60, 60)) +
  labs(colour = "Production") +
  theme_minimal(base_family = "Raleway", base_size = 14) + 
  theme(legend.position="bottom") + 
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)) +
  geom_vline(xintercept = 0, color = "grey10", size = 0.2) +
  geom_hline(yintercept = 0, color = "grey10", size = 0.2))

## Use Harrell's example:
# ptrans <- transcan(~ gross.pay + phi.h + position.cat + pressure + random.1 + random.2 + gross.pay.transform,
#                    imputed = TRUE, 
#                    transformed = TRUE,
#                    trantab = TRUE, 
#                    pl = FALSE , 
#                    show.na = TRUE, 
#                    data = hunt, 
#                    frac = 0.1, 
#                    pr = FALSE)
# 
# summary(ptrans, digits=4)
# 
# s <- sPCAgrid(ptrans$transformed , 
#               k = 6, 
#               method='sd', 
#               center = mean, 
#               scale = sd,
#               scores = TRUE, 
#               maxiter = 10) 
# 
# 
# ## Function to add variance explained
addscree <- function(x, npcs=min(10, length(x$sdev)), plotv=FALSE , col=1, offset=.8, adj=0, pr=FALSE)
  {
  vars <- x$sdev^2
  cumv <- cumsum(vars)/sum(vars)
  if(pr) print(cumv)
  text(1:npcs, vars[1:npcs] + offset*par('cxy')[2], as.character(round(cumv[1:npcs], 2)), srt=45, adj=adj, cex=.65, xpd=NA, col=col)
  if(plotv) lines(1:npcs, vars[1:npcs], type='b', col=col)
  }
# 
# s$loadings
# 
# pcs <- s$scores
# aic <- numeric(6)
# for(i in 1:6) { 
#   ps <- pcs[,1:i] 
#   aic[i] <- AIC(ols(hunt$production ~ ps)) 
#   } 
# 
# 
# par(family = 'Raleway',
#     ps = 10,
#     mfrow = c(1, 2),
#     mar = c(3, 2.5, 2, 1.5))
# 
# plot(s, type='lines', ylim=c(0,3), main = "Screeplot: PC Contributions")
# addscree(s) 
# plot(1:6, aic, xlab='Number of Components Used', ylab='AIC', type='l', main = "How Well Do the PC Perform in Predicting Production") 
# 
# autoplot(s,
#          label = TRUE,
#          loadings.label = TRUE) +
#   theme_minimal(base_family = "Raleway")

The plot above is called a biplot and it is filled with information:

Principal component 1 accounts for 45% of the variation
- Principal component 2 accounts for 23% of the variation
- Together, the two principal components account for 68% of the variation
- Points that are close to one another have similar values on the variables: for instance, rows 3, 5 and 7 are similar in their gross.pay, gross.pay.transform and phi.h profile, but very different compared to row 15.
- The correlation between the three vectors gross.pay.transform, gross.pay and phi.h is high as represented by the small angle.
- The correlation between the three vectors gross.pay.transform, gross.pay and phi.h is positive because all vectors are pointing in similar directions.
- Take point 18 (far right) we would expect this observation to have one of the highest value for gross.pay.transform, gross.pay and phi.h
- Production is color coded and there seems to be a cluster of low production associated with low gross.pay.transform, gross.pay and phi.h

One of the limitations of principal components analysis is that the new PC are not directly interpretable.

¹ We will focus on the first two principal components.

Summary of Variable Selection

There is little consensus among researchers on what constitutes a sensible variable selection strategy. In fact, many authors are critical of several variable selection techniques and suggest utilizing domain knowledge coupled with variable selection blinded to to the response¹.

Our example is hardly demonstrative, since we are dealing with a very limited sample size. Nonetheless, we can summarize the results as follows:

Method	No. of Variables Selected	Variable Removed	Notes
Domain knowledge	4	`random.1`, `random.2`, `gross.pay.transform`	Baseline
Clustering	NA	`gross.pay` and `gross.pay.transform` can be represented by a single variable	Only suggestive, not a true variable selection technique
Redundancy Analysis	5	`gross.pay`, `gross.pay.transform`
LASSO	5	`random.1`, `gross.pay.transform`	`random.2` was very close to zero
Recursive Feature Elimination	4	`random.1`, `random.2`, `gross.pay.transform`
Exhaustive search	4	`random.1`, `random.2`, `gross.pay.transform`	Based on visual inspection
Sparse Principal Components	2	`random.1`, `position.cat`	No a true variable selection technique

From these limited data, recursive feature elimination performed best. There are many variable selection strategies and we have only skimmed through a few of them.

Little consensus exists around variable selection. Harrell ^1,2 states that

Variable selection (without penalization) only makes things worse. Variable selection has almost no chance of finding the “right” variables, and results in large overstatements of effects of remaining variables and huge understatement of standard errors.

Kuhn³ confirms by writing:

This underscores the idea that feature selection is a poor method for determining the most significant variables in the data, as opposed to which predictors mostly influenced the model. Different models will score the predictors differently. To assess which predictors have individual associations with the outcome, common classical statistical methods are far more appropriate.

Hence, using domain knowledge before applying any statistical methods is preferable. If statistical or ML approaches are to be used, it is best to start with methods blinded to the outcome. Shrinkage methods, such as the LASSO, can be used as a next step.

¹ See Harrell (2015), Regression Modeling Strategies
² See Harrell (2015), CrossValidated
³ See Kuhn et al. (2013), Applied Predictive Modeling

Sample Size and Power Calculations

Sample size calculations are customary tools used to estimate how many observations one needs in order to detect a given effect with a certain probability. Effect here means, quite broadly, any difference between, say, two means (or other parameter). For instance, one may be interested in the difference between the mean production in the presence or absence of certain geological features and wonder how many wells we would need in order to conclude with a given confidence that the two means are different from each other.

Key points and definitions in the evaluation of sample sizes are:

Large differences are easier to detect than smaller differences
We may commit two types of errors, called type I (alpha, or false positive) and type II error (beta, or false negative)
Power refers to 1-beta, or the probability of detecting a difference if fact it exists
Normally, we solve for n (sample size) or power

Example
Suppose we are trying to assess whether a hypothetical geological feature is associated with increased production. How many wells are needed to achieve a 80% probability of detecting a difference in mean production (1-beta or power), and we are willing to accept a 10% probability of declaring a false positive (alpha or significance level)?

In the simplest scenario one could utilize the following formula:

\[\normalsize n = \frac{(Z_{\alpha /2}+Z_{\beta})^{2}\times 2\sigma ^2}{d^{2}}\]

where \(Z_{\alpha /2}\) is the critical value from the normal distribution at α/2 (for a two-sided test); \(Z_{\beta}\) is the critical value from the normal distribution at β; \(\sigma ^2\) is the population variance (or pooled variance); \(d\) is the difference between the two means.

From the equation we need an estimate for \(d = \overline{x_{1}}-\overline{x_{2}}\) and the population standard deviation, \(\sigma\). Normally, we take these estimates from other similar analysis or data sets. Suppose from previous analysis we know that:

Mean production when geological feature is present: 48 (SD: 15.6)
Mean production when geological feature is not present: 27 (SD: 11.6)

It is always wise to solve for a range of plausible values, rather than a single point. In the graph below, we have included a range of values: 34 to 44 for \(\mu_1\) and 24 to 34 for \(\mu_2\) ¹.

Solving for the above equation yield the following curves^2,3.

## This uses output from NCSS PASS
ss1 <- structure(list(Power = c(0.807, 0.805, 0.803, 0.802, 0.8, 0.8, 
0.801, 0.807, 0.805, 0.803, 0.802, 0.8, 0.816, 0.801, 0.807, 
0.805, 0.803, 0.802, 0.819, 0.816, 0.801, 0.807, 0.805, 0.803, 
0.824, 0.819, 0.816, 0.801, 0.807, 0.805, 0.849, 0.824, 0.819, 
0.816, 0.801, 0.807), N1 = c(26L, 40L, 70L, 156L, 619L, NA, 18L, 
26L, 40L, 70L, 156L, 619L, 14L, 18L, 26L, 40L, 70L, 156L, 11L, 
14L, 18L, 26L, 40L, 70L, 9L, 11L, 14L, 18L, 26L, 40L, 8L, 9L, 
11L, 14L, 18L, 26L), N2 = c(26, 40, 70, 156, 619, 1.7e+308, 18, 
26, 40, 70, 156, 619, 14, 18, 26, 40, 70, 156, 11, 14, 18, 26, 
40, 70, 9, 11, 14, 18, 26, 40, 8, 9, 11, 14, 18, 26), µ1 = c(34, 
34, 34, 34, 34, 34, 36, 36, 36, 36, 36, 36, 38, 38, 38, 38, 38, 
38, 40, 40, 40, 40, 40, 40, 42, 42, 42, 42, 42, 42, 44, 44, 44, 
44, 44, 44), µ2 = c(24, 26, 28, 30, 32, 34, 24, 26, 28, 30, 32, 
34, 24, 26, 28, 30, 32, 34, 24, 26, 28, 30, 32, 34, 24, 26, 28, 
30, 32, 34, 24, 26, 28, 30, 32, 34), µ1...µ2 = c(10, 8, 6, 4, 
2, 0, 12, 10, 8, 6, 4, 2, 14, 12, 10, 8, 6, 4, 16, 14, 12, 10, 
8, 6, 18, 16, 14, 12, 10, 8, 20, 18, 16, 14, 12, 10), s1 = c(16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16), s2 = c(12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
12, 12, 12, 12, 12, 12, 12, 12, 12), Alpha = c(0.1, 0.1, 0.1, 
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), Beta = c(0.193, 0.195, 0.197, 
0.198, 0.2, 0.2, 0.199, 0.193, 0.195, 0.197, 0.198, 0.2, 0.184, 
0.199, 0.193, 0.195, 0.197, 0.198, 0.181, 0.184, 0.199, 0.193, 
0.195, 0.197, 0.176, 0.181, 0.184, 0.199, 0.193, 0.195, 0.151, 
0.176, 0.181, 0.184, 0.199, 0.193)), .Names = c("Power", "N1", 
"N2", "µ1", "µ2", "µ1...µ2", "s1", "s2", "Alpha", "Beta"), class = "data.frame", row.names = c(NA, 
-36L))

## Or could have used R
# pwr::pwr.t.test(d=(0-18)/14.14214,
#                 power = 0.8,
#                 sig.level = 0.10,
#                 type = "two.sample",
#                 alternative = "two.sided")


ss1 <- ss1 %>%
  mutate(µ2 = as.factor(µ2))

plotly::ggplotly(ggplot(ss1, aes(x = µ1, y = N1, color = µ2)) +
  geom_point(size = 3) + 
  geom_line(size = 1) +
  scale_colour_viridis_d() +
  scale_y_log10(breaks = c(8, 9, 10, 12, 14, 16, 20, 24, 28, 34, 44, 50, 60, 70, 80, 100, 150, 250, 400, 600)) +
annotation_logticks(sides = "lr") +
  labs(y = "N1",
       color = "µ2",
       caption = "N2 assumed to be equal to N1") +
  theme_minimal(base_family = "Raleway", base_size = 14) +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))
)

Therefore, if we believe that the true mean production with and without a geological feature is, say, 42 and 26 (\(d=16\)), respectively, we would need at least 11 wells for each of the two geological features (total 22) in order to have an 80% probability of detecting a difference in production as small as 16^4,6. We are willing to accept a 10% chance of concluding there is a significant difference in production, when, in fact, there is none.

The statistical field on sample size and power calculations is very rich. Here, we have taken numerous liberties in order to illustrate the general approach. Readers interested in this area are advised to work with statisticians⁵ and to utilize sample size and power calculation tools or libraries.

¹ Readers may wonder why the range of values do not match the means provided. We constructed the range by examining bootstrap replicates of mean production in order to have a better sense of how much the mean could vary.
² The choice of solving for n or power depends on the objective. For power, it is tempting to perform the calculations after collecting the data and utilize the observed parameters (post-hoc). This is considered poor statistical practice.
³ We utilized a different formula than the one shown in order to account for different variances.
⁴ In this test we chose what’s called a two-sided test: we are entertaining the possibility that one mean could higher or lower than the other mean. A one-sided test is possible, but should only be performed if we have strong evidence that one mean is higher than the other.
⁵ The number of statistical procedures for sample size and power calculations can be overwhelming. For instance, PASS, a software tool for sample size and power calculations, lists as many as 740 statistical tests.
⁶ We have chosen to have equal sample size allocation. However, one could choose different sample sizes for n1 and n2. This is useful when, for instance, the cost of sampling is higher in one group than the other.

Regression

In this section we will fit a regression model to better understand changes in production. We are going to utilize the previous findings of the variable selection by entertaining a model with gross.pay, phi.h, pressure and position.cat. Notice that the sample size is much too small to fit a model with all these variables. A general rule of thumb suggests we need 15-20 observations per variable; therefore, we would need about 80 observations (wells in this case): far short of what we have.

Recall from the correlation matrix gross.pay and phi.h are highly correlated. This is going to be a problem when we interpret the model because of collinearity. Collinearity makes it difficult to estimate and interpret a particular regression coefficient because the data have little information about the effect of changing one variable while holding another (highly correlated) variable constant (Harrell, 2015).

So, which variable should we keep in the model, gross.pay or phi.h?
A sensible strategy is to consider domain knowledge about the physics to guide which one should be in the model. One also needs to consider costs: if one variable is significantly more expensive to collect data for in the future, we might want stop further consideration. However, suppose we have no preference. Let’s revisit redundancy analysis, but this time focus only on a subset of variable:

(redun.2 <- redun(~ gross.pay + phi.h + I(position.cat) + pressure, 
     r2 = 0.70,
     data = hunt))

## 
## Redundancy Analysis
## 
## redun(formula = ~gross.pay + phi.h + I(position.cat) + pressure, 
##     data = hunt, r2 = 0.7)
## 
## n: 21    p: 4    nk: 3 
## 
## Number of NAs:    0 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.7    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##       gross.pay           phi.h I(position.cat)        pressure 
##           0.838           0.802           0.301           0.037 
## 
## Rendundant variables:
## 
## gross.pay
## 
## Predicted from variables:
## 
## phi.h I(position.cat) pressure 
## 
##   Variable Deleted   R^2 R^2 after later deletions
## 1        gross.pay 0.838

What does this tells us? That it’s marginally easier to predict gross.pay from all other variables (R²=0.838) than it is to predict phi.h. Therefore, we are better off removing gross.pay and keep phi.h instead.

Non Linearities

It is generally wise not to assume a system, especially in nature, behaves linearly. The major limitation here, however, is sample size: entertaining non-linearities is more expensive on the data. Thus, it is important to be very selective. A general technique is to entertain non-linearities only for those variables that have enough predictive power.

Once again, recall the correlation matrix. Which variables among phi.h, position.cat and pressure have the highest correlation with the response, production? That would be phi.h (distance correlation=0.88). Therefore, we will assign more degrees of freedom to phi.h, on the grounds that missing a non-linear relationship here would be costlier.

How exactly are we going to make a variable non-linear? Several approaches exist. One flexible approach is based on restricted cubic splines. The details are beyond the scope of this tutorial. Interested readers can refer to Davis¹ who articulates the intution and provides a fascinating historical perspective.

¹ Davis, P. (1996), B-Splines and Geometric Design.

Linear Model

Armed with this information, it’time to fit a linear model. But before we do that, we need to decide on the kind of model. The size of the data is much too small to consider models from the Machine Learning field¹.

We could try least squares regression, but our response, production, is a rate and we could run into problems. Beta regression, poisson or negative binomial may provide better approaches. Here we will illusrate a robust approach, that of ordinal regression, a type of semi-parametric model.

¹ Van der Ploeg et al. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

##The third approach is to treat it the proportion as a censored continous variable.  The censoring means that you don’t have information below 0 or above 1.  For example, perhaps the plant would spread even more if it hadn’t run out of land.  If you take this approach, you would run the model as a two-limit tobit model (Long, 1997).  This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

dd <- datadist (hunt); options(datadist = 'dd')

f1 <- orm(production ~ 
          rcs(phi.h, 3) +
          position.cat +
          pressure,
          data = hunt,
          family = "probit",
          x=TRUE, y=TRUE)

options(prType="html")
f1

Probit Ordinal Regression Model

 orm(formula = production ~ rcs(phi.h, 3) + position.cat + pressure, 
     data = hunt, x = TRUE, y = TRUE, family = "probit")

	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs 21	LR χ² 54.82	R² 0.929	ρ 0.969
Unique Y 21	d.f. 4	g 4.831
Y_0.5 36.42	Pr(>χ²) <0.0001	g_r 125.315
max \|∂log L/∂β\| 0.005	Score χ² 34.25	\|Pr(Y ≥ median)-½\| 0.430
	Pr(>χ²) <0.0001

	β	S.E.	Wald Z	Pr(>\|Z\|)
phi.h	0.1185	0.0253	4.68	<0.0001
phi.h’	-0.0515	0.0171	-3.02	0.0025
position.cat=[2,3]	-2.7903	0.7852	-3.55	0.0004
pressure	0.4958	0.1246	3.98	<0.0001

# f1.resid <- resid(f1)
# plot(hunt$phi.h, f1.resid, cex=0.4); lines(lowess(hunt$phi.h, f1.resid)); abline(h=0, col=2,lty=2)

## Robust estimates
# set.seed(123)
# f1.boot <- bootcov(f1, coef.reps = TRUE, B = 2000)

## Prepare quantiles
bar <- Mean(f1)
qu <- Quantile(f1)
p50 <- function(x) qu(0.50, x)
p25 <- function(x) qu(0.25, x)
p75 <- function(x) qu(0.75, x)

# Point estimates
ggplot(Predict(f1, fun = bar),
       colfill = "gray50",
       adj.subtitle = FALSE,
       addlayer = theme_minimal(base_family = "Raleway", base_size = 14) +
       labs(y = "Production"))

## Heatmap
heatmap <- Predict(f1, phi.h, pressure, position.cat, fun = bar, np = 90)
heatmap <- as.data.frame(heatmap)

plotly::ggplotly(ggplot(heatmap, aes(phi.h, pressure)) + 
     geom_tile(aes(fill = yhat), color = "white")  + 
     scale_fill_viridis(discrete = FALSE, option = "D") +
     facet_grid(.~position.cat) +
     labs(x = "\nphi.h",
          y = "Pressure",
          fill = "Predicted\nMean\nProduction",
          title = "Predicted Production",
          subtitle = "Predicted mean production as a function of pressure, phi.h and position",
          caption = "Insufficient data to estimate production when pressure is 15.") + 
     scale_x_continuous(breaks = seq(0, 160, 20)) +
     scale_y_continuous(breaks = seq(0, 20, 2)) +
     theme_minimal() +
     theme(plot.title = element_text(face = "bold", hjust = 0, vjust = 0.8, colour = "#3C3C3C", size = 22, family = "Raleway")) +
     theme(plot.subtitle = element_text(hjust = 0, vjust = 0.8, colour = "#3C3C3C", size = 10, family = "Raleway")) +
     theme(plot.caption = element_text(size = 7, family = "Raleway")) +
     theme(axis.title = element_text()) +
     theme(axis.title.y = element_text(size = 13, angle = 90, family = "Raleway")) +
     theme(axis.title.x = element_text(size = 13, angle = 0, family = "Raleway")) +
     theme(axis.text.x = element_text(size = 10, family = "Raleway", color = "black", margin = margin(r=0))) +
     theme(axis.text.y = element_text(size = 10, family = "Raleway", color = "black")) +
     theme(legend.text = element_text(size = 8, family = "Raleway")) +
     theme(legend.title = element_text(size = 9, family = "Raleway", hjust = 0.5)) +
     theme(legend.position = "right") +
     theme(axis.line = element_blank(), 
           axis.ticks.y = element_blank(),
           panel.grid.major = element_blank(),
           panel.grid.minor = element_blank(),
           panel.border = element_blank()) +
     guides(colour = guide_legend(ncol = 1)) +
     theme(legend.title.align=0) +
     theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)))

Colophon

This document and the analysis presented were performed in R, a programming language and software environment for statistical computing and graphics.

Graphs were plotted using the ggplot2 package, Lattice or D3 libraries, such as Plotly
The HTML template is readthedown
The document uses the Raleway font family
Rstudio was used as the R IDE

Last updated on 2018-05-06 11:19:25

GeoConvention 2018 Data Science Tools for Petroleum Exploration and Production

GeoConvention 2018 Data Science Tools for Petroleum Exploration and Production

Preamble

Description of the Example Dataset

Is There A Difference in Production?

Distance Correlation

A Note on the Choice of Correlation

Smoothers

Variable Selection

Clustering

Redundancy Analysis

LASSO

Recursive feature elimination

Exhaustive Search

Sparse Principal Components

Summary of Variable Selection

Sample Size and Power Calculations

Regression

Non Linearities

Linear Model

Colophon

GeoConvention 2018
Data Science Tools for Petroleum Exploration and Production

GeoConvention 2018
Data Science Tools for Petroleum Exploration and Production