GeoConvention 2018
Data Science Tools for Petroleum Exploration and Production

Preamble

This tutorial is meant to accompany a talk given by Matteo Niccoli and myself (Thomas Speidel) at the 2018 geoconvention in Calgary, Canada titled Data science tools for petroleum exploration and production. Many of the ideas were inspired by Nicooli’s past work. See for instance, Machine learning in geoscience with scikit-learn - notebook 1.

This tutorial was created using the open-source language R. However, most of the methods illustrated can be done in Python. See Niccoli’s implementation of some of these methods in the Python environment. See for instance the implementation of distance correlationin Python.



Description of the Example Dataset

We will illustrate a number of methodologies on the data set by Hunt1. Niccoli2 illustrated a number of statistical techniques on this data set in a series of notebooks. As stated: “the target variable to be predicted, Y, is oil production (measured in tens of barrels of oil per day) from a marine barrier sand”. Below, is a sample of the data:

1 Hunt, L. (2013), Many correlation coefficients, null hypotheses, and high value. CSEG Recorder, 38 (10)
2 Niccoli, M (2017), Machine learning in geoscience with scikit-learn - notebook 2



Is There A Difference in Production?

Suppose we are interested in assessing whether production changes according to position. The conventional approach is to perform ANOVA or t-test. However, both relies on normality and are sensitive to outliers. When comparing 2 samples, we can utilize the Wilcoxon test, while fore more than samples, the Kruskal-Wallis test is a generalization of the Wilcoxon test.

First, let’s assess the distribution graphically. From the boxplot below, it is not apparent whether there is a difference in production between positions:


Next, we perform the Wilcoxon test:

estimate statistic p.value conf.low conf.high method alternative
14.98 83 0.041 4.64 27.3 Wilcoxon rank sum test two.sided

A P-value of 0.041, tells us that the probability of observing a difference in distribution equal to or more extreme than the one observed is 4.1%. In other words, it’s somewhat unlikely but not impossible that the difference in production is due to chance variation.


Distance Correlation

Measures of correlation quantify the strength of the relationship between pairs of variables. The traditional Pearson correlation has several limitations, one of which is that it assumes a linear relationship. The Spearman correlation is generally preferred and avoids some of these limitations. A relatively new and powerful measure of correlation is the distance correlation.

Above is the distance correlation matrix. Colored boxes represent statistically significant correlations at the 10% significance level. The correlation measure is able to detect that gross.pay is highly correlated to gross.pay.transform (they are algebraically related). Also, gross.pay is highly correlated with production. Notice that, unlike other correlation measures, distance correlation doe not compute direction (positive or negative correlation).


A Note on the Choice of Correlation

Among measures of correlations capable of detecting non-linear association, MIC (maximal information criterion, Reshef et al.) has been touted as a powerful measures. However, several authors Kinney2, Simon3 point to a number of problems with MIC and suggest to utilize distance correlation instead:

“We believe that the recently proposed distance correlation measure of Székely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use”.

1 Rashef et al. (2011), Detecting novel associations in large data sets
2 Kinney et al. (2013), Equitability, mutual information, and the maximal information coefficient
3 Simon et al. (2014), Comment on "Detecting Novel Associations In Large Data Sets



Smoothers

Correlation matrices are useful to visualize combinations of pairwise correlations. There is, however, more information to be had by plotting the data and adding a trend line. One of the most flexible is Cleveland’s LOESS (locally weighted scatter plot smoothing). Let’s plot the LOESS curves for each variable against production, separate for each position.

g1 <- ggplot(hunt, aes(y = production, x = gross.pay, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Gross Pay") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))

g2 <- ggplot(hunt, aes(y = production, x = phi.h, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Phi-h") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))


g3 <- ggplot(hunt, aes(y = production, x = pressure, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Pressure") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))

g4 <- ggplot(hunt, aes(y = production, x = random.1, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Random Var") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))

g5 <- ggplot(hunt, aes(y = production, x = gross.pay.transform, colour = factor(position.cat))) +
  stat_plsmo(fullrange = TRUE) + 
  geom_point() +
  xlab("Gross Pay (transformed)") +
  ylab("Production") +
  theme_minimal(base_family = "Raleway", base_size = 11) +
  scale_color_discrete(guide = guide_legend(title = "Position")) +
  theme(legend.position="none") +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))


g1+g2+g3+g4+g5

Scatter plots of production against each variable by position (red: blue:). A LOESS curve was added to highlight the overall trend.



Variable Selection

Which variables should we focus on in understanding what drives changes in production? Which variables are measuring the same or similar underlying quantities?

A statistically principled approach is to start by eliminating variables based on domain knowledge. This may be assisted by statistical data reduction methods such as clustering and redundancy analysis.


Clustering

Here we assess clusters of independent variables. We do so blinded to the response variable, production, in order to avoid creating bias. In cluster analysis, the objective is to identify variables that are measuring the same underlying phenomenon. If they exists and supported by domain knowledge, one could argue for removing one of them.

Hierarchical cluster dendogram of all variables except production. The similarity matrix is based on the Hoeffding D statistics which will detect non-monotonic associations.

There are numerous methods to perform hierarchical clustering. It’s good practice to try different methods to see if the results are in the same ballpark. Let’s do that with Hierarchical Clustering with P-Values via Multiscale Bootstrap1:

## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.

From either cluster dendograms we can see how gross.pay and gross.pay.transform cluster together and, to a lesser extent, phi.h. Based on examination of these results, the natural choice is to remove either gross.pay or gross.pay.transform from further analyses. Suppose we don’t know gross.pay.transform is algebraically related to gross.pay: which of the two should we exclude? Redundancy analysis, helps us determine that.

1Suzuki et al. (2006), Pvclust: an R package for assessing the uncertainty in hierarchical clustering


Redundancy Analysis

In redundancy analysis we look at which variable can be predicted with high confidence from any combination of the other variables. Those variables can then be safely omitted from further analysis. This approach is more robust than the pairwise correlation measures from before. We’ve set the adjusted R2 value at 0.7.

We started with:

##  [1] "gross.pay'"          "phi.h'"              "I(position.cat)"    
##  [4] "pressure"            "pressure'"           "random.1"           
##  [7] "random.1'"           "random.2"            "random.2'"          
## [10] "gross.pay.transform"

The model suggests that:

##            gross.pay gross.pay.transform'                phi.h 
##            0.9731901            0.8916161            0.7507618

can be removed because they are predicted from all the remaining variables. The numbers represent the R2 with which a variable ican be predicted from all other remaining ones. However, we need to be very cautious, since the sample size is much too low to reliably suggest which variables can be omitted.


LASSO

The previous methods achieved variable reduction without consideration of the response variable, production. This is a sound approach, however, we may wish to do variable selection within a regression model. LASSO (least absolute shrinkage and selection operator) achieves that by shrinking the model coefficients. The primary purpose of shrinkage methods is that of improving predictive accuracy. However, if coefficients are shrunk to zero, the LASSO effectively achieves variable selection.

Fitting a LASSO model to the data, yield the following shrunk coefficients.

Variable Shrinked Coefficient
(Intercept) -4.278
gross.pay 1.032
phi.h 0.116
pressure 1.332
random.1 NA
random.2 0.010
gross.pay.transform NA
position.cat -6.544

Looking at the table of coefficients, notice how hunt.random.1 and gross.pay.transform have no coefficient. That’s because they have been shrunk to zero, thus achieving variable selection. Our model suggests that these two variable are not useful in explaining changes in production. The coefficient for hunt.random.2 is also nearly zero, so we could remove it as well.

To gain more insight, it helps to visualize how fast coefficients are shrunk to zero (thus, excluded from the model) as we increase the amount of shrinkage, \(\lambda\).


Moving from left to right, we see that if \(log(\lambda)\) (x-axis) is sufficiently large, all coefficients are shrunk to zero (y-axis), meaning all variables have been excluded. As we move to the right and \(log(\lambda)\) decreases, gross.pay enters the model, followed by phi.h. Moving to the far right, we see that when \(log(\lambda)\) decreases (i.e. \(\lambda\) approaches 0), little shrinkage occurs and no variable selection occurs.

Throughout the graph, random.1, random.2, remain consistently close to the zero line, suggesting that no matter the value of \(\lambda\), little weight is ever given to either. The vertical line at 0.052 is the value of \(log(\lambda)\) chosen by cross-validation that minimizes the loss function (this is what was used to generate the table above).

For this model the loss function is the mean squared error (MSE). The value of \(\lambda\) minimizing the loss function is depicted by the red line and chosen via cross-validation.

It is helpful to visualize how MSE changes as a function of \(\lambda\), but more crucially, to visualize the error bars of each MSE. This graph should give us pause in picking a \(\lambda\) value, since the majority of \(\lambda\) < 0.052 are nearly equivalent.


Recursive feature elimination

There is no shortage of methods on variable selection coming from the Machine Learning literature. Recursive feature elimination1 is a model-based method. The technical details are beyond the scope of this article.

The majority of model-based variable selection methods perform better with normalized data (subtract the mean and divide by the standard deviation).

Variables RMSE Rsquared MAE
1 9.313 0.614 7.588
2 8.247 0.732 6.697
3 6.613 0.828 5.464
4 3.770 0.937 3.015
5 4.078 0.925 3.293
6 4.049 0.926 3.184
7 4.012 0.928 3.091

From the table we can see a model with 4 variables achieves the highest R2 and the lowest RMSE and highest R2. Their standard deviation is also among the lowest. Those 4 variables are:

## [1] "hunt.phi.h"     "hunt.pressure"  "hunt.gross.pay" "position.cat"

We can, of course, plot the results for more immediacy:

1 Guyon et al., Gene Selection for Cancer Classification using Support Vector Machines


Sparse Principal Components

Principal components have a long and rich history in statistics and several applied sciences. Principal components were first proposed by Harold Hotelling in the 1930’s. Principal components allow us to describe our data with fewer new variables. These new variables are linear combinations of multiple variables. We call these new variables principal components. These components are derived in such a way to maximize the variation across variables. For a wonderful explanation of principal components see this CrossValidated discussion.

Sparse principal components use ideas similar to that of the LASSO in the sense that instead of utilizing all variables, only some are for some components (hence the term sparse). Therefore, as with the LASSO, some variables are shrunk to zero. Let’s take a look at the loadings, that is, the weights of the linear combinations1:

Comp.1 Comp.2 Comp.3 Comp.4
gross.pay 0.5990 0.0000 0.0000 0.0000
phi.h 0.5450 -0.0814 0.0000 0.0000
pressure 0.0000 -0.3480 0.0000 0.8804
random.1 0.0000 0.2194 0.9449 0.0000
random.2 0.0000 -0.6331 0.3275 0.0000
gross.pay.transform 0.5867 0.0756 0.0000 0.0000
position.cat 0.0000 0.6462 0.0000 0.4742

Notice how, in the loadings matrix, some weights were shrunk to zero. gross.pay is assigned the greatest weight in PC1, while position.cat is assigned the greatest loading in PC2. Next, we are going to visualize the PC’s:


The plot above is called a biplot and it is filled with information:

  • Principal component 1 accounts for 45% of the variation
    • Principal component 2 accounts for 23% of the variation
    • Together, the two principal components account for 68% of the variation
    • Points that are close to one another have similar values on the variables: for instance, rows 3, 5 and 7 are similar in their gross.pay, gross.pay.transform and phi.h profile, but very different compared to row 15.
    • The correlation between the three vectors gross.pay.transform, gross.pay and phi.h is high as represented by the small angle.
    • The correlation between the three vectors gross.pay.transform, gross.pay and phi.h is positive because all vectors are pointing in similar directions.
    • Take point 18 (far right) we would expect this observation to have one of the highest value for gross.pay.transform, gross.pay and phi.h
    • Production is color coded and there seems to be a cluster of low production associated with low gross.pay.transform, gross.pay and phi.h

One of the limitations of principal components analysis is that the new PC are not directly interpretable.


1 We will focus on the first two principal components.


Summary of Variable Selection

There is little consensus among researchers on what constitutes a sensible variable selection strategy. In fact, many authors are critical of several variable selection techniques and suggest utilizing domain knowledge coupled with variable selection blinded to to the response1.

Our example is hardly demonstrative, since we are dealing with a very limited sample size. Nonetheless, we can summarize the results as follows:

Method No. of Variables Selected Variable Removed Notes
Domain knowledge 4 random.1, random.2, gross.pay.transform Baseline
Clustering NA gross.pay and gross.pay.transform can be represented by a single variable Only suggestive, not a true variable selection technique
Redundancy Analysis 5 gross.pay, gross.pay.transform
LASSO 5 random.1, gross.pay.transform random.2 was very close to zero
Recursive Feature Elimination 4 random.1, random.2, gross.pay.transform
Exhaustive search 4 random.1, random.2, gross.pay.transform Based on visual inspection
Sparse Principal Components 2 random.1, position.cat No a true variable selection technique

From these limited data, recursive feature elimination performed best. There are many variable selection strategies and we have only skimmed through a few of them.

Little consensus exists around variable selection. Harrell 1,2 states that

Variable selection (without penalization) only makes things worse. Variable selection has almost no chance of finding the “right” variables, and results in large overstatements of effects of remaining variables and huge understatement of standard errors.

Kuhn3 confirms by writing:

This underscores the idea that feature selection is a poor method for determining the most significant variables in the data, as opposed to which predictors mostly influenced the model. Different models will score the predictors differently. To assess which predictors have individual associations with the outcome, common classical statistical methods are far more appropriate.

Hence, using domain knowledge before applying any statistical methods is preferable. If statistical or ML approaches are to be used, it is best to start with methods blinded to the outcome. Shrinkage methods, such as the LASSO, can be used as a next step.

1 See Harrell (2015), Regression Modeling Strategies
2 See Harrell (2015), CrossValidated
3 See Kuhn et al. (2013), Applied Predictive Modeling



Sample Size and Power Calculations

Sample size calculations are customary tools used to estimate how many observations one needs in order to detect a given effect with a certain probability. Effect here means, quite broadly, any difference between, say, two means (or other parameter). For instance, one may be interested in the difference between the mean production in the presence or absence of certain geological features and wonder how many wells we would need in order to conclude with a given confidence that the two means are different from each other.

Key points and definitions in the evaluation of sample sizes are:

  1. Large differences are easier to detect than smaller differences
  2. We may commit two types of errors, called type I (alpha, or false positive) and type II error (beta, or false negative)
  3. Power refers to 1-beta, or the probability of detecting a difference if fact it exists
  4. Normally, we solve for n (sample size) or power

Example
Suppose we are trying to assess whether a hypothetical geological feature is associated with increased production. How many wells are needed to achieve a 80% probability of detecting a difference in mean production (1-beta or power), and we are willing to accept a 10% probability of declaring a false positive (alpha or significance level)?

In the simplest scenario one could utilize the following formula:

\[\normalsize n = \frac{(Z_{\alpha /2}+Z_{\beta})^{2}\times 2\sigma ^2}{d^{2}}\]

where \(Z_{\alpha /2}\) is the critical value from the normal distribution at α/2 (for a two-sided test); \(Z_{\beta}\) is the critical value from the normal distribution at β; \(\sigma ^2\) is the population variance (or pooled variance); \(d\) is the difference between the two means.

From the equation we need an estimate for \(d = \overline{x_{1}}-\overline{x_{2}}\) and the population standard deviation, \(\sigma\). Normally, we take these estimates from other similar analysis or data sets. Suppose from previous analysis we know that:

Mean production when geological feature is present: 48 (SD: 15.6)
Mean production when geological feature is not present: 27 (SD: 11.6)

It is always wise to solve for a range of plausible values, rather than a single point. In the graph below, we have included a range of values: 34 to 44 for \(\mu_1\) and 24 to 34 for \(\mu_2\) 1.

Solving for the above equation yield the following curves2,3.

## This uses output from NCSS PASS
ss1 <- structure(list(Power = c(0.807, 0.805, 0.803, 0.802, 0.8, 0.8, 
0.801, 0.807, 0.805, 0.803, 0.802, 0.8, 0.816, 0.801, 0.807, 
0.805, 0.803, 0.802, 0.819, 0.816, 0.801, 0.807, 0.805, 0.803, 
0.824, 0.819, 0.816, 0.801, 0.807, 0.805, 0.849, 0.824, 0.819, 
0.816, 0.801, 0.807), N1 = c(26L, 40L, 70L, 156L, 619L, NA, 18L, 
26L, 40L, 70L, 156L, 619L, 14L, 18L, 26L, 40L, 70L, 156L, 11L, 
14L, 18L, 26L, 40L, 70L, 9L, 11L, 14L, 18L, 26L, 40L, 8L, 9L, 
11L, 14L, 18L, 26L), N2 = c(26, 40, 70, 156, 619, 1.7e+308, 18, 
26, 40, 70, 156, 619, 14, 18, 26, 40, 70, 156, 11, 14, 18, 26, 
40, 70, 9, 11, 14, 18, 26, 40, 8, 9, 11, 14, 18, 26), µ1 = c(34, 
34, 34, 34, 34, 34, 36, 36, 36, 36, 36, 36, 38, 38, 38, 38, 38, 
38, 40, 40, 40, 40, 40, 40, 42, 42, 42, 42, 42, 42, 44, 44, 44, 
44, 44, 44), µ2 = c(24, 26, 28, 30, 32, 34, 24, 26, 28, 30, 32, 
34, 24, 26, 28, 30, 32, 34, 24, 26, 28, 30, 32, 34, 24, 26, 28, 
30, 32, 34, 24, 26, 28, 30, 32, 34), µ1...µ2 = c(10, 8, 6, 4, 
2, 0, 12, 10, 8, 6, 4, 2, 14, 12, 10, 8, 6, 4, 16, 14, 12, 10, 
8, 6, 18, 16, 14, 12, 10, 8, 20, 18, 16, 14, 12, 10), s1 = c(16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16), s2 = c(12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 
12, 12, 12, 12, 12, 12, 12, 12, 12), Alpha = c(0.1, 0.1, 0.1, 
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), Beta = c(0.193, 0.195, 0.197, 
0.198, 0.2, 0.2, 0.199, 0.193, 0.195, 0.197, 0.198, 0.2, 0.184, 
0.199, 0.193, 0.195, 0.197, 0.198, 0.181, 0.184, 0.199, 0.193, 
0.195, 0.197, 0.176, 0.181, 0.184, 0.199, 0.193, 0.195, 0.151, 
0.176, 0.181, 0.184, 0.199, 0.193)), .Names = c("Power", "N1", 
"N2", "µ1", "µ2", "µ1...µ2", "s1", "s2", "Alpha", "Beta"), class = "data.frame", row.names = c(NA, 
-36L))

## Or could have used R
# pwr::pwr.t.test(d=(0-18)/14.14214,
#                 power = 0.8,
#                 sig.level = 0.10,
#                 type = "two.sample",
#                 alternative = "two.sided")


ss1 <- ss1 %>%
  mutate(µ2 = as.factor(µ2))

plotly::ggplotly(ggplot(ss1, aes(x = µ1, y = N1, color = µ2)) +
  geom_point(size = 3) + 
  geom_line(size = 1) +
  scale_colour_viridis_d() +
  scale_y_log10(breaks = c(8, 9, 10, 12, 14, 16, 20, 24, 28, 34, 44, 50, 60, 70, 80, 100, 150, 250, 400, 600)) +
annotation_logticks(sides = "lr") +
  labs(y = "N1",
       color = "µ2",
       caption = "N2 assumed to be equal to N1") +
  theme_minimal(base_family = "Raleway", base_size = 14) +
  theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA),
        plot.background = element_rect(fill = "#FCFCFC", colour = NA))
)


Therefore, if we believe that the true mean production with and without a geological feature is, say, 42 and 26 (\(d=16\)), respectively, we would need at least 11 wells for each of the two geological features (total 22) in order to have an 80% probability of detecting a difference in production as small as 164,6. We are willing to accept a 10% chance of concluding there is a significant difference in production, when, in fact, there is none.

The statistical field on sample size and power calculations is very rich. Here, we have taken numerous liberties in order to illustrate the general approach. Readers interested in this area are advised to work with statisticians5 and to utilize sample size and power calculation tools or libraries.


1 Readers may wonder why the range of values do not match the means provided. We constructed the range by examining bootstrap replicates of mean production in order to have a better sense of how much the mean could vary.
2 The choice of solving for n or power depends on the objective. For power, it is tempting to perform the calculations after collecting the data and utilize the observed parameters (post-hoc). This is considered poor statistical practice.
3 We utilized a different formula than the one shown in order to account for different variances.
4 In this test we chose what’s called a two-sided test: we are entertaining the possibility that one mean could higher or lower than the other mean. A one-sided test is possible, but should only be performed if we have strong evidence that one mean is higher than the other.
5 The number of statistical procedures for sample size and power calculations can be overwhelming. For instance, PASS, a software tool for sample size and power calculations, lists as many as 740 statistical tests.
6 We have chosen to have equal sample size allocation. However, one could choose different sample sizes for n1 and n2. This is useful when, for instance, the cost of sampling is higher in one group than the other.



Regression

In this section we will fit a regression model to better understand changes in production. We are going to utilize the previous findings of the variable selection by entertaining a model with gross.pay, phi.h, pressure and position.cat. Notice that the sample size is much too small to fit a model with all these variables. A general rule of thumb suggests we need 15-20 observations per variable; therefore, we would need about 80 observations (wells in this case): far short of what we have.

Recall from the correlation matrix gross.pay and phi.h are highly correlated. This is going to be a problem when we interpret the model because of collinearity. Collinearity makes it difficult to estimate and interpret a particular regression coefficient because the data have little information about the effect of changing one variable while holding another (highly correlated) variable constant (Harrell, 2015).

So, which variable should we keep in the model, gross.pay or phi.h?
A sensible strategy is to consider domain knowledge about the physics to guide which one should be in the model. One also needs to consider costs: if one variable is significantly more expensive to collect data for in the future, we might want stop further consideration. However, suppose we have no preference. Let’s revisit redundancy analysis, but this time focus only on a subset of variable:

## 
## Redundancy Analysis
## 
## redun(formula = ~gross.pay + phi.h + I(position.cat) + pressure, 
##     data = hunt, r2 = 0.7)
## 
## n: 21    p: 4    nk: 3 
## 
## Number of NAs:    0 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.7    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##       gross.pay           phi.h I(position.cat)        pressure 
##           0.838           0.802           0.301           0.037 
## 
## Rendundant variables:
## 
## gross.pay
## 
## Predicted from variables:
## 
## phi.h I(position.cat) pressure 
## 
##   Variable Deleted   R^2 R^2 after later deletions
## 1        gross.pay 0.838

What does this tells us? That it’s marginally easier to predict gross.pay from all other variables (R2=0.838) than it is to predict phi.h. Therefore, we are better off removing gross.pay and keep phi.h instead.


Non Linearities

It is generally wise not to assume a system, especially in nature, behaves linearly. The major limitation here, however, is sample size: entertaining non-linearities is more expensive on the data. Thus, it is important to be very selective. A general technique is to entertain non-linearities only for those variables that have enough predictive power.

Once again, recall the correlation matrix. Which variables among phi.h, position.cat and pressure have the highest correlation with the response, production? That would be phi.h (distance correlation=0.88). Therefore, we will assign more degrees of freedom to phi.h, on the grounds that missing a non-linear relationship here would be costlier.

How exactly are we going to make a variable non-linear? Several approaches exist. One flexible approach is based on restricted cubic splines. The details are beyond the scope of this tutorial. Interested readers can refer to Davis1 who articulates the intution and provides a fascinating historical perspective.

1 Davis, P. (1996), B-Splines and Geometric Design.


Linear Model

Armed with this information, it’time to fit a linear model. But before we do that, we need to decide on the kind of model. The size of the data is much too small to consider models from the Machine Learning field1.

We could try least squares regression, but our response, production, is a rate and we could run into problems. Beta regression, poisson or negative binomial may provide better approaches. Here we will illusrate a robust approach, that of ordinal regression, a type of semi-parametric model.

1 Van der Ploeg et al. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

Probit Ordinal Regression Model
 orm(formula = production ~ rcs(phi.h, 3) + position.cat + pressure, 
     data = hunt, x = TRUE, y = TRUE, family = "probit")
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 21 LR χ2 54.82 R2 0.929 ρ 0.969
Unique Y 21 d.f. 4 g 4.831
Y0.5 36.42 Pr(>χ2) <0.0001 gr 125.315
max |∂log L/∂β| 0.005 Score χ2 34.25 |Pr(Y ≥ median)-½| 0.430
Pr(>χ2) <0.0001
β S.E. Wald Z Pr(>|Z|)
phi.h   0.1185  0.0253 4.68 <0.0001
phi.h’  -0.0515  0.0171 -3.02 0.0025
position.cat=[2,3]  -2.7903  0.7852 -3.55 0.0004
pressure   0.4958  0.1246 3.98 <0.0001

## Heatmap
heatmap <- Predict(f1, phi.h, pressure, position.cat, fun = bar, np = 90)
heatmap <- as.data.frame(heatmap)

plotly::ggplotly(ggplot(heatmap, aes(phi.h, pressure)) + 
     geom_tile(aes(fill = yhat), color = "white")  + 
     scale_fill_viridis(discrete = FALSE, option = "D") +
     facet_grid(.~position.cat) +
     labs(x = "\nphi.h",
          y = "Pressure",
          fill = "Predicted\nMean\nProduction",
          title = "Predicted Production",
          subtitle = "Predicted mean production as a function of pressure, phi.h and position",
          caption = "Insufficient data to estimate production when pressure is 15.") + 
     scale_x_continuous(breaks = seq(0, 160, 20)) +
     scale_y_continuous(breaks = seq(0, 20, 2)) +
     theme_minimal() +
     theme(plot.title = element_text(face = "bold", hjust = 0, vjust = 0.8, colour = "#3C3C3C", size = 22, family = "Raleway")) +
     theme(plot.subtitle = element_text(hjust = 0, vjust = 0.8, colour = "#3C3C3C", size = 10, family = "Raleway")) +
     theme(plot.caption = element_text(size = 7, family = "Raleway")) +
     theme(axis.title = element_text()) +
     theme(axis.title.y = element_text(size = 13, angle = 90, family = "Raleway")) +
     theme(axis.title.x = element_text(size = 13, angle = 0, family = "Raleway")) +
     theme(axis.text.x = element_text(size = 10, family = "Raleway", color = "black", margin = margin(r=0))) +
     theme(axis.text.y = element_text(size = 10, family = "Raleway", color = "black")) +
     theme(legend.text = element_text(size = 8, family = "Raleway")) +
     theme(legend.title = element_text(size = 9, family = "Raleway", hjust = 0.5)) +
     theme(legend.position = "right") +
     theme(axis.line = element_blank(), 
           axis.ticks.y = element_blank(),
           panel.grid.major = element_blank(),
           panel.grid.minor = element_blank(),
           panel.border = element_blank()) +
     guides(colour = guide_legend(ncol = 1)) +
     theme(legend.title.align=0) +
     theme(panel.background = element_rect(fill = "#FCFCFC", colour = NA), plot.background = element_rect(fill = "#FCFCFC", colour = NA)))



Colophon

This document and the analysis presented were performed in R, a programming language and software environment for statistical computing and graphics.

Last updated on 2018-05-06 11:19:25

Thomas Speidel

May 7-9, Calgary, Canada