Predicting soil texture from laboratory analysis results of selected parameters and comparison of different models #3

A data science project for justifying empirical observation in the lab

Daniel J. TOTH
8 min readFeb 3, 2021
Final visual comparison of the resulting eight models

Introduction and reminder

This is the third out of a series of three blog posts on analyzing data originating from a soil testing laboratory. A quick overview of the goals set in the first post:

it has been empirically hinted by professionals, that the liquid limit of a soil sample — from which the physical classification of soil samples is derived — is primarily determined by the humus materials and soluble salt content (represented by specific conductivity) and may affected by the other parameters. Analyzing the features would enable to infer the physical classification of soil samples by predicting liquid limit.

The following workflow has been planned:

  1. Preparation of dataset for analysis, description of dataset
  2. Apply at first a multiple linear regression model, then a 2nd degree polynomial model. Description and comparison of models by some simple visualization and collecting model parameters
  3. Apply weighted multiple linear and polynomial regression models, description and comparison of models by slightly advanced visualization

Each of the aforementioned list elements are separate blog posts.

This post is concerned about #3, applying additional regression models, based on the two already existing ones see code in the second postand application of more sophisticated visualization for comparison.

Code and comments

Considering the previously presented multicollinearity issue and feature weight ratios, the next model would include features ‘W’ and ‘H’. Building a linear and a 2nd degree polynomial models were repeated. Some additional aspects of this consideration (see below Input[36] again):

- ‘W’ and ‘H’ are the most significant features of prediction,

- the other parameters show much more variation, sometimes difference measured in magnitudes,

- the large variations are not inherent to the soil sample due to these parameters are affected by agricultural management of land

I sliced the original train/test data of features and performed the scaling again:

Reset the index for both sets:

Then comes scaling:

Linear regression has been performed:

The score got significantly lower, so it seems the features left out have some, but limited predictive power. At this moment, I had 3 models, so a comparative figure would have been useful. However, as I dug deeper while exploring this dataset, I got inspired to apply weighted regression as well. So it seemed practical to finish building of models first than apply a comparative figure at the end. At this point, it is worth to have a brief overview of the models which have been done and which was going to be done:

- a linear model considering all features (done),

- a 2nd degree polynomial considering all features (done),

- a linear model considering soluble salt content (‘W’) and humus content (‘H’) (done),

- a 2nd degree polynomial considering soluble salt content (‘W’) and humus content (‘H’) (to go),

- all the abovementioned models, but applying weights (1 / y),

which yield 8 different models to compare.

As for the 2nd degree polynomial considering ‘W’ and ‘H’:

Then comes application of weights, then repeating the model building and appending error and prediction values for the table of comparison. The .fit() method is capable of handling weights, I just had to define and add it as another argument:

Here I used Pipeline() function imported from sklearn.pipeline, adding the weights when fitting. Before, I used make_pipeline() function. I missed to elaborate this previously, but either way the trick is to generate new features, including interaction terms and 2nd degree terms out of the original ones, then apply linear regression on the new features. Both functions construct a list of functions that are to be applied sequentially. The difference between the two is that Pipeline() permits only transform functions and a final estimator, while make_pipeline() sets no restrictions. I added the weights at the ‘regressor’ step:

Once I learned how to apply the models in general, I can quickly set up a new model in a few cells. I can tell, that most of the coding time were spent on manipulating the data cleaning the dataframe.

All the models are ready for comparison to draw a conclusion. Before visualization, let’s compare the scores. For better comprehensibility, I formatted the output as 4 digits floats and added comments:

It is time for visualization. I decided to use scatter plots, just as before, however excluding seaborn’s option for color coding. That would only distract focus from the comparison, so I used matplotlib design. I also added an orange line corresponding to x = y function, so deviations could be observed. This is a figure including 4 x 2 subplots. This setup enables several options for a clear and nice output, e.g. sharing axes and setting figure size:

I admit that at first I tried drawing the grid at once, then the plots one by one. Eventually I applied nested ‘for’ cycles, and the code become so much cleaner.

It took no less time to set up the nested ‘for’ cycles, but organizing the inputs, labels and colors in lists from which the arguments are selected for the plotting function effectively results comprehensibility and that is essential for a project larger than mine.

‘y’ axis labels had to be specified for first subplots in each row of the whole figure, once ‘sharey’ is set to True. OK, so as for a conclusion, weights have not meant so much in this dataset overall, since the patterns and scores are pretty much the same. However, switching to 2nd degree polynomial from the linear model in both approaches significantly decreased outliers. Interesting thing is, that selecting the two main features and 2nd degree polynomial model at once, extreme outliers vanished. Reduced models (features ‘W’ and ‘H’) probably eliminate variance of agricultural soil management, however the neglected features also has predictive power at some degree, since scores are lower when excluded from models and the displayed pattern of data align in a seemingly less favorable way. I tried to compare the variances of the models by visualizing the distribution of prediction errors. For that I used (now deprecated) sns.distplot():

I used different setups for comparison. I made individual graphs in subplots, but I made an overlap plot as well. I had to draw the overlap plot three times with different scales, because the error distributions are so close to each other that they are practically indistinguishable from one another. The height of the Gauss-like distribution of errors correspond to the scores and zooming reveals the difference in tails. I left out the weighted models from this presentation because their added value is negligible. The interquartile ranges from descriptive statistics are also useful:

My last visualization practice would gave an answer to an another question that came in my mind:

I already assigned True or False values for texture prediction and I was wondering if I could differentiate between the two classes. Are the distribution of feature values significantly different?

An answer would help building a more complex model. The significance of a conclusion lies in that when I cleaned the data table got rid of “lower than quantification limit” (“<LOQ”) whatever it was and replaced them with exactly the LOQ. In the case of a feature has many entries just above the LOQ, this could result in a regression error and could be the source of observed distribution of scale dependent prediction errors (when I plotted the error of prediction in terms of test values). First, I show the code for sns.histplot(), with several altered parameters. Instead of kernel density estimate (‘kde’), I used histogram plot instead, as ‘kde’ assigns false probabilities to values outside of defined range. Another reason for the settings is that I needed a uniform scale and bin size for both data, however the default algorithm automatically adjusts these parameters to the observed distribution, which are meant to be different (data sample size, min and max values). Maybe a second y axis would overcome the difference in sample size and help comparability, but this time I made uniform bins by setting the range and width of bins equivalent. For any conclusion, I had to keep in mind that the sample sizes for the plotted ‘2nd degree polynomial without weights’ model are 3426 and 3371 respectively, although practically the same:

Note: I draw the histograms while trying to avoid introducing new parameters therefore I nested .loc() functions instead. The reason is up until that point I have already counted at least 60 parameters defined by myself and additional 200 which were defined by different functions used. In a real life project, a careful balance must be applied between parameter complexity and code comprehensibility. The logic of my code is:

  1. to find the index values in the dataframe, where the selected model (2nd degree polynomial without weights applied) predicted the texture of the soil sample right or wrong
    predictions_comparison.loc[‘condition’].index.values’, where
    ‘condition’ is ‘prediction_comparison[‘column for selected model’] == True/False’, where ‘column for selected model’ is 2nd degree polynomial, without weights applied (‘Error_of_texture_prediction_poly’)
  2. find among the test values of the selected feature that correspond to the previously located index values
    x_test_reset.loc[‘condition’, ‘column of selected feature’], where
    ‘condition’ is the index value, and ‘column of selected feature’ is each feature listed one by one
  3. create a nested list of extracted values, where level 1 elements correspond to the number of features, level 2 elements correspond to extracted values where texture prediction were right or wrong respectively. E.g. in order to call for the ‘H’ values, where texture prediction were wrong, the index value is: ‘xt_hist_input[2][1]’, since ‘H’ values are 3rd of level 1 elements and values for wrong prediction are 2nd of level 2 elements of the nested list
  4. create nested list of bin ranges where range values correspond to the levels of nested list explained in 3. and a list for the colors
  5. draw a grid with 7 rows and one column
  6. display data points on each grid by creating a nested ‘for’ cycle, where level 1 correspond for each grid and level 2 display the values for each (True/False) condition on the actual grid

Final remarks

Qualitatively, it seems the most successful model tend to fail prediction when the values of each parameter (except nitrite-nitrate content and pH) were closer to limit of quantification compared to successful predictions. However, the difference is moderately distinct, so a quantitative control would be necessary. If deployment of a model was the primary goal, I would recommend introducing more precise analytical methods with lower quantification limits. A better resolution of feature values would probably decrease the variance of dataset and would increase the predictive power of the model. This was my own project, so it is not viable and I leave the models like that, but such feedbacks could also add value for a data gathering process in general.

Some practical aspects of this series of posts:

  1. cleaning the dataset from inappropriate characters, leaving out ‘null’ values and converting entries to proper datatype (float, int) are a must, but addressing unjustifiable entries is also necessary. It would put an emphasis on the issue if No. of data points are limited
  2. visualization is a must to conclude whether you are ready with preparation of data. Advanced visualization is also needed for a proper inference and reporting and this may take a lot of practice. Several types of plots are available and one need to find the most expressible out of those
  3. it is preferred to use cycles when possible. Not just for reduction of code characters, but that enables a neat organization of parameters and steps applied
  4. a careful balance must be kept between comprehensibility and code complexity (No. of variables defined). It is desirable to set a limit for excess variables, but parameterized solutions may require additional comments
  5. real life examples with many features may show much variation. It takes much more effort to infer the sources of variation and other issues such as multicollinearity are also to be addressed in depth

Although the aforementioned aspects of my findings may be a topic for a future rehearsal, this is the end of my mini series of posts on soil lab analysis.

You earned my respect, if you have read all the three :)

--

--

Daniel J. TOTH

Biochemical engineer, former environmental lab analyst with quality focus. I do my best to bring high quality content to data science learners/enthusiasts.