Proc. 18th Int. Conf. Near Infrared Spectrosc., pp. 51–55 (2019)
The aim of our study was to test an iterative process of validation implemented in the R software, assessing the accuracy of the best selected equations, developed using two different regression algorithms Partial Least Square (PLS) and Bayesian. A data set (Seta) with 3187 records of 6 different types of forages was used. The calibrations were tested for Protein, Neutral Detergent Fiber and Acid Detergent Fiber. For each sample a spectrum was collected using a FOSS NIRSystem (1100–2498 nm). A subset composed of 20 samples for each type of forage (Setext;120 samples) was randomly selected for a final validation of the best selected equations. The remaining samples (Setb = Seta – Setext) were used for the iterative calibration process. For each iteration the Setb was randomly divided in a testing set (Settst; 10 % of Setb) and a training set (Settrn = Setb – Settst); 300 iterations were done. All of the computations were done in the R environment. The packages used were “pls” for the PLS, “BGLR” for the Bayesian, “prospectr” for the spectral treatments. In each iteration we used three spectral treatments (raw, 1 derivative, standard normal variate and detrend), two approaches for selection of the optimal number of PLS components and the Bayesian model. Nine types of equations were developed and tested in each iteration [(2 PLS techniques + 1 Bayesian) × 3 spectral treatments]. Among the 300 iterations, for each one of the 9 equation types, the best one (lowest RMSE) and the average of the best 25 % (RMSE < 1 quartile) were selected and validated by forage type. R has demonstrated its potential when used for the chemiometric process on big data set and with complex statistical procedures. R2 higher than 0.9 was obtained for almost all the calibrations. In the external validation the Bayesian models in many cases outperform the commonly used PLS, demonstrating that an alternative for the improvement of the prediction accuracy exists. The present work has demonstrated that iterative validation subsampling on big data can lead to the selection of proper equations, and it can be done using R.