Wednesday, 9 May 2012

Use R! - Part 2 9.5.12



Here is a follow-up of my first post about using R. For our yearly KU Leuven Geology PhD Seminar (08-09/05/2012), I quickly pasted this script together from several examples I had run into in the past, as well as some things that I have been doing myself in R. Be aware these analyses of the Meuse dataset are not always very useful! This was just to demonstrate several numerical/statistical techniques within R, in an attempt to add a few more R converts to the community. Below the example graphs, some explanation about what they represent, and the R code that I sent to all members of the statistics and numerical analysis discussion group.


The Meuse data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables, collected in a flood plain of the river Meuse, near the village Stein. Below a map of the relative cadmium concentrations, and the scatterplot matrix for the different metal concentrations.


Linear mixed effects models are suited for linear dependence modelling, when your dataset has one or more categorical variables, and you expect both a general relationship, as well as slight differences between the different classes. What one could do with the Meuse dataset is the following: predict the copper concentrations in function of the lead concentration. The bold black line is the result of linear regression. By allowing an additional random effect for each flood frequency class, one obtains the other 3 lines, i.e. one for each class (this was done with lme4). This is probably not very useful for this dataset, but I have used this extension of linear regression several times by relating different material properties with for example the stratigraphy as categorical variable. This definitely is a useful technique within the geosciences.


Next is the non-linear modelling of the metal concentrations' dependence on elevation, organic matter content and the distance from the river Meuse. While the concentrations show mutually linear dependence, this is not true for the other three variables, and we can for example use an artificial neural network (data transformations might of course also work, but I wanted to show how to use ANNs in R with AMORE).


The two figures below show the model fit (predicted vs target values) after training a single network, and the leave-one-out cross-validation results to estimate the true predictive error. Especially from the lead and zinc plots, it becomes apparent that the training error is smaller than the true predictive error. Adjusting with the ANN parameters would probably yield better results though, but this was outside of the scope of this demonstration.


Next is data clustering, an unsupervised machine learning technique. In the Meuse dataset, it would be nice to discover different classes based on the metal concentrations. The lead vs copper scatterplot nicely shows different classes due to the distance from the Meuse. Supposed we do not know these distances, are we able to detect these classes just from the concentration data?


Two algorithms are compared here: k-means and model-based clustering. For the first one the number of classes is specified. The plots are created for up to ten classes. The model-based clustering (done with MCLUST) uses expectation maximization methods to estimate the optimal number of classes. This example clearly demonstrates the disadvantages of k-means clustering, as all clusters are equal in size and have a more or less spherical shape. The use of model-based clustering overcomes these problems, and provides a decent classification, according to the distance to the river Meuse, as given in the plot above.


Finally, some geostatistics of course! Below are the examples for the ordinary kriging of the logarithmic zinc concentrations, as provided in the gstat demo.



And here is the script, if you want to try it yourself:

####################################################################
# PhD Seminar KU Leuven Geology Section: R examples, by B. Rogiers #
####################################################################
# Download and install R from: http://cran.freestatistics.org/
# Open this script in RGui, or a more advanced R editor like RStudio: http://rstudio.org/
# Execute the code lines one by one
 
install.packages(c('gstat','lme4','AMORE','mclust')) # install packages; only necessary the first time!
 
library(gstat) # load geostatistics package, that contains our dataset
data(meuse) # load the example dataset
?meuse # Help file for the Meuse river data set
plot(meuse$x, meuse$y, cex=(meuse$cadmium)^(1/3), pch=16, xlab='X coordinate',ylab='Y coordinate',main='Cadmium concentration (ppm)') # plot sample locations and cadmium concentration
plot(meuse[,3:6]) # concentration scatterplot matrix
n <- nrow(meuse) # number of samples
 
# Dependence modelling
 
  # Linear mixed models
 
    library(lme4) # load necessary package
    classes <- as.numeric(meuse$ffreq) # choose data classes: flood frequency 1,2,3
    plot(meuse$lead, meuse$copper, col=classes, pch=classes, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Flood frequency classes') # plot lead and copper concentrations for each of the classes
    abline(lm(meuse$copper ~ meuse$lead)$coef, lwd=2) # draw a fitted linear regression line
    ?lmer # Help file for the lmer function
    lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes)) # fit a linear mixed effects model, with a random effect for each class
    coefs <- coef(lmer(meuse$copper ~ meuse$lead + (meuse$lead|classes))) # save the coefficients of this model
    for(i in 1:3) abline(a=coefs$classes[i,1], b=coefs$classes[i,2],col=i, lty=i) # plot the 3 resulting lines
 
  # Neural networks
 
    library(AMORE) # load necessary package
    plot(meuse[,c(3:7,9,14)]) # scatterplot matrix of some numeric parameters
    input <- meuse[-c(42,43),c(7,9,14)] # use elevation, distance to the Meuse, and organic matter content (input data), to model
    target <- meuse[-c(42,43),c(3:6)] # the cadmium, copper, lead and zinc concentrations (target values); line 42 and 43 contain NA values
    input <- scale(input, center=T, scale=T) # standardize input data
    target <- scale(target, center=T, scale=T) # standardize target data
    ?newff # Help file for the newff function
    ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm") # create a neural network
    ?train # Help file for the train function
    trainedANN <- train(ann, input, target, show.step=1, n.shows=1000) # train the neural network
    fit <- sim(trainedANN$net, input) # get the predictions for all data
    par(mfrow=c(2,2)) # plot 2 x 2 figures
    plot(fit[,1], target[,1], xlab='Model prediction', ylab='Target variable', main='Cadmium') # scatterplot of observations vs predictions
    plot(fit[,2], target[,2], xlab='Model prediction', ylab='Target variable', main='Copper') # scatterplot of observations vs predictions
    plot(fit[,3], target[,3], xlab='Model prediction', ylab='Target variable', main='Lead') # scatterplot of observations vs predictions
    plot(fit[,4], target[,4], xlab='Model prediction', ylab='Target variable', main='Zinc') # scatterplot of observations vs predictions
    par(mfrow=c(1,1)) # reset plot options
 
# Uncertainty quantification
# http://episun7.med.utah.edu/~alun/teach/stats/week08.pdf
 
  hist(meuse$cadmium) # histogram of the cadmium concentrations
  mean(meuse$cadmium) # mean of cadmium concentrations
  var(meuse$cadmium)/n # variance of the sample mean
  dataSample <- meuse$cadmium # use all data
  #n <- 15;dataSample <- sample(meuse$cadmium, size=n) # use smaller data sample
 
  # Bootstrapping
  # estimate variance of a statistic
  # good for smaller samples or awkward distributions
  # generate set of random samples with replacement to evaluate the statistic
 
    # non-parametric bootstrap:
 
      s <- 1000 # number of resampling cases
      t <- rep(0,s) # initialize the resulting vector of means
      for (i in 1:s) t[i] = mean(sample(dataSample,replace=TRUE)) # loop through all resampling cases, and calculate the mean
      mean(t) # bootstrapped estimate of mean
      var(t) # bootstrapped estimate of variance of the mean
 
    # semi-parametric bootstrap:
 
      a <- 0.1 # smoothing parameter
      for (i in 1:s) t[i] = mean( sample(dataSample,replace=TRUE) + a * rnorm(n)) # loop through all resampling cases, add random normal error to each sample, and calculate the mean
      mean(t) # bootstrapped estimate of mean
      var(t) # bootstrapped estimate of variance of the mean
 
  # Jack-knifing
  # estimate variance and bias of a statistic
  # more orderly version of the bootstrap
  # generate n samples by leaving one observation out at a time
 
    t = rep(0,n) # initialize the resulting vector of means
    for (i in 1:n) t[i] = mean(dataSample[-i]) # loop through all cases
    mean(t) # jack-knifed estimate of the mean
    (n-1)/n * sum( (t-mean(t))^2 ) # jack-knifed estimate of the variance of the mean
    n * mean(dataSample) - (n-1) * mean(t) # jack-knifed bias-corrected estimate
 
  # Cross-validation
  # evaluating predictive error for a model fitted to all data
 
    cv <- function(input=input, target=target, sample=i) # function to predict a sample through cross-validation, with the ANN we created above
    {
      cvSample <- input[i,];inputCV <- input[-i,];targetCV <- target[-i,] # remove the ith sample from the data
      ann <- newff(c(3,20,4), learning.rate.global=0.03, momentum.global=0.1, error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", output.layer="purelin", method="ADAPTgdwm")
      trainedANNCV <- train(ann, inputCV, targetCV, show.step=500, n.shows=1)
      return(sim(trainedANNCV$net, input)[i,]) # predict the left out sample
    }
    results <- matrix(nrow=nrow(input), ncol=4) # initialize the results matrix
    for(i in 1:nrow(input)) results[i,] <- cv(input, target, i) # loop through all samples, applying the above function
    par(mfrow=c(2,2)) # plot 2 x 2 figures
    for(i in 1:4) {plot(results[,i], target[,i], xlab='Model prediction',ylab='Target variable',main=c('Cadmium','Copper','Lead','Zinc')[i]); print(mean((target[,i]-results[,i])^2))} # plot observations vs predictions for each element
    par(mfrow=c(1,1)) # reset plot options
 
# Clustering
# Unsupervised machine learning technique to find classes in a dataset  
 
  plot(meuse$lead, meuse$copper, cex=(meuse$dist.m)^(1/3)/7, pch=16, xlab='Lead (ppm)',ylab='Copper (ppm)',main='Distance to Meuse river') # let us classify the lead and copper concentrations; two groups are clearly present: close to and far away from the Meuse
 
  # K-means clustering
  # http://en.wikipedia.org/wiki/K-means_clustering
 
    ?kmeans # Help file for the kmeans function
    par(ask=T) # adapt graphical parameters; now press enter in the console window to display plots
    for(i in 1:10) # loop from 1 to 10 k-means classes
    {
      classes <- kmeans(x=data.frame(lead=meuse$lead, copper=meuse$copper), centers=i, iter.max=100, nstart=100)$cluster # classify
      plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)', main=paste('K-means clustering, with',i,'classes')) # plot data with classes
    }
    par(ask=F) # reset graphical parameters
 
  # Model-based clustering
  # http://www.stat.washington.edu/raftery/Research/mbc.html
 
    library(mclust) # load necessary package
    ?Mclust # Help file for the Mclust function
    classes <- Mclust(data.frame(lead=meuse$lead, copper=meuse$copper))$classification # classify
    plot(meuse$lead, meuse$copper, col=classes, pch=classes,xlab='Lead (ppm)',ylab='Copper (ppm)',main='Model-based clustering') # plot data with classes
 
# Geostatistics
 
  ?gstat # Help file for the gstat function
  coordinates(meuse) = ~ x+y # Set x and y variables as coordinates
  data(meuse.grid) # Load data grid for estimating Zinc concentrations
  gridded(meuse.grid) = ~ x+y # Set data as gridded with x and y as coordinates
  v <- vgm(0.581, "Sph", 900, nug = 0.0554) # Variogram model
  g <- gstat(id = "ln.zinc", form = log(zinc) ~ 1,data = meuse, nmax = 40, nmin = 20, maxdist = 1000, model = v) # Initialize gstat object
  plot(variogram(g), model=v) # Plot variogram model fit
  x <- predict(g, meuse.grid) # Predict using ordinary kriging
  spplot(x["ln.zinc.pred"], main = "log(zinc) ordinary kriging predictions", col.regions=terrain.colors(30)) # Plot ordinary kriging predictions
  x$ln.zinc.se = sqrt(x$ln.zinc.var) # convert kriging variance to kriging standard error for plotting
  spplot(x["ln.zinc.se"], main = "log(zinc) ordinary kriging prediction standard errors", col.regions=terrain.colors(30)) # Plot ordinary kriging standard error
 
  # need more examples?
 
    demo(cokriging)
    demo(examples)
 
# Other useful info:
 
  # The R project website: http://www.r-project.org/
  # List of packages: http://cran.freestatistics.org/web/packages/available_packages_by_name.html
  # Visualization package for the R statistical analysis platform, loosely based on "The Grammar of Graphics" from Leland Wilkinson: http://had.co.nz/ggplot2/
  # R graph gallery: http://addictedtor.free.fr/graphiques/
  # R & LaTeX: 
    # knitr: http://yihui.name/knitr/
    # Sweave: http://www.statistik.lmu.de/~leisch/Sweave/
5 Bart Rogiers: Use R! - Part 2 Here is a follow-up of my first post about using R . For our yearly KU Leuven Geology PhD Semin...

No comments:

Post a comment:

< >