Introduction and reminder
This is the second out of a series of three blog posts on analyzing data originating from a soil testing laboratory. A quick overview of the goals set in the first post:
it has been empirically hinted by professionals, that the liquid limit of a soil sample — from which the physical classification of soil samples is derived — is primarily determined by the humus materials and soluble salt content (represented by specific conductivity) and may affected by the other parameters. Analyzing the features would enable to infer the physical classification of soil samples by predicting liquid limit.
The following workflow has been planned:
- Preparation of dataset for analysis, description of dataset
- Apply at first a multiple linear regression model, then a 2nd degree polynomial model. Description and comparison of models by some simple visualization and collecting model parameters
- Apply weighted multiple linear and polynomial regression models, description and comparison of models by slightly advanced visualization
Each of the aforementioned list elements are separate blog posts.
This post is concerned about #2, applying two different regression models, collecting model parameters and drawing simple figures for visualization and comparison.
Code and comments
A clean dataframe has been set, which is free of ‘null’ entries and the number of features have been reduced by acquiring sample data associated with the two most abundant lab service package. Original dataframe consists of 46 739 x 23 = 1 074 997 entries which have been limited to a number of 33 996 x 8 = 271 968 entries.
I made a distribution plot and it became apparent, that I had to deal with unjustifiable datapoints which seemed nonsensical by experience. There is a deprecated warning, because I first used sns.histplot() for ‘pH’ then sns.distplot() for ‘KA’. Latter quickly draw a ‘kde’ (kernel density estimate) in addition to bars, but my future code will prefer sns.histplot() (if not, I keep warnings on as a reminder):
KA values contain extreme outliers. I cannot check at the moment how these values have been generated. It’s a minority, so I handle them as outliers and slice them out of the dataframe. An important note: two methods are available for lab analysis, an automated and a manual method. Both methods classify the soil with specific limits in several classes, but even the highest limit is less than 100, so values of several hundreds are hardly interpretable.
With .unique() method I can sort and present the extreme values:
I set a limit at KA = 167 so I drop the rows which contain higher KA values:
Let’s try sns.distplot() again:
Using the ‘kde’ feature in a histogram may turn out to be disadvantageous for datasets, where data points have a limited range and the mode happen to be or near that specific limit, falsely attributing likelihood for values outside the defined range. Since it is possible to set up the function displaying only the ‘kde’ without bin frequency for a cleaner visualization, in general a misinterpretation by lay audience is likely.
For checking the correlation among the suspected most impactful features and liquid limit, I used sns.scatterplot(). It has the ‘hue’ parameter for adding a second variable, which determines the color pattern of data points:
Qualitatively a relationship is present between the features and ‘KA’. Data are scattered, but darker tones of color could be identified more at the top and right corner. Unfortunately, the bulk of data is scattered in a limited region and no information is displayed on density. Time to apply a regression analysis, fingers crossed for explaining the scatter by the model. The input features for the first analysis are non-standardized, just for practice and comparability to standardized feature analysis. I extract the column headers (except ‘KA’) to a list for a parameterized solution, which I will practice and prefer later on:
The score for the regression:
It is a bit lower than I anticipated, but still has an explanatory power. I extracted the coefficients and intercept in a dataframe formatted as 4 significant figures:
Considering the coefficients derived from non-standardized features, it’s hard to justify the preliminary assumption of ‘W’ as an impactful feature. I standardized the features and repeated the multiple linear regression:
Note: the scaling yielded an array instead of a dataframe. Creating a separate dataframe for the new standardized parameters and merging the two dataframes together:
Standardized coefficients finally justify the preliminary assumption: ‘W’ and ‘H’ truly are the most impactful features. Important note: test split has not been applied, so the model has not been tested. I build the model again, with test split applied and add the parameters to the dataframe for comparison:
The order of the rows has been randomized, the train data is 80% of the original dataframe, 20% is split for testing. Train and test dataframes have to be scaled again:
I get the parameters and add it to the comparison dataframe:
The intercept practically remained the same, but the coefficients slightly shifted, so the test split definitely made an impact. I compared the score values:
Same dataset and same optimization algorithm, so the first two models equally perform. Applying the split lowered the score, but strengthen the model by testing it on independent sample. I set up a new dataframe documenting the actual test values and predictions of the different models. However, in the case of standardized model with test split the index values have completely mixed up, since the test split randomized the order. When I call .predict() method on the regression model and extract it to the new dataframe, indices are reset, so I need to reset the indices for test values as well:
I calculate the error of prediction numerically for later use:
I will keep only the standardized model with test split because I will try a multiple polynomial regression for comparison. Before advancing to polynomials, I need to classify the soil samples in accordance of lab tests. Significance lies in that customers are primarily interested in classification as a qualification, rather than assignment of a numerical value, however the accepted error for numerical prediction is ± 2 units of ‘KA’ which seems a strict limit in light of the calculated results. I define the list of conditions and for each a condition value, then use np.select() to assign the class for each row in a new column (Hungarian definitions are as follows: ‘Durva homok’: coarse sand, ‘Homok’: sand, ‘Homokos vályog’: sandy loam, ‘Vályog’: loam, ‘Agyagos vályog’ clay loam: , ‘Agyag’: clay, ‘Nehéz agyag’: fine clay:
I change the column order so the freshly created and appended column is placed right after the test values:
I repeat the aforementioned steps to classify the samples according to predicted texture and add the same results after a new relevant model has been made. Additionally to:
- predicted value (float),
- error of prediction (float),
- predicted texture (string, assigned class)
I also add for each future model:
- error of predicted texture (Boolean)
So counting True/False values is a different way to qualify the model and may be useful as a condition for comparison of model to test values later on:
I used numpy’s np.where which returned a warning frankly, I cannot understand at the moment, but it’s clear that value assignment could be done by Pandas .loc(). I will keep that in mind. I also changed the column headers for excluding spaces which I forgot to do while setting up the dataframe and turned out to be a mistake (without output):
OK I had 6797 predictions, it’s time to do some plotting. I called again matplotlib/seaborn for help, this time setting the figure size manually and putting some limitations on the axes. Color coding depend on deviation from an x = y correspondance:
Some extreme predictions have clearly appeared, but considering the number of points and the score of the model, this dataset turned out to be acceptable for demonstration. Let’s see another plot, representing the deviation from the test value in terms of the test value itself:
It seems there is a clear scale dependent trend of the error of prediction. At the moment, I don’t have the experience to come to a definite conclusion, but I think at least one of the prerequisites of linear regression has not been met. Multicollinearity should be among those criteria, since the soluble salt content definitely correlates with extractable forms of phosphorus, potassium and nitrite-nitrate. Basically with every other parameter, since it is derived from conductivity measurement and every parameter contributing to higher ion concentration has an effect on the measurement. This has been reinforced statistically: although 5 out of 7 got seemingly insignificant coefficients, the features give an extremely low p-value for the F-test of linear dependency (tested as significant) and that usually is a hint for multicollinearity:
I should probably deal with the issue by applying principal component analysis (PCA), but since it is for demonstration purposes, I decided to go along with building models PCA excluded. I leave the issue for future rehearsal. Instead, I have built a 2nd degree polynomial regression model on the dataset, which was a hint I could reason from plotting the features in Tableau Public. I already justified, that ‘W’ and ‘H’ are the most significant features and a 2nd degree polynomial seemed a better fit (especially for ‘H’).
There are multiple ways to conduct a polynomial regression. As you will see, I applied two different approaches. Sources are also included. I don’t want to be ahead of things so the second method will be shown later on. As for the first method, I have to call some relevant functions. I need to mention, that the source initially applied LASSO regularization for a machine learning algorithm, where the degree of polynomial function was also a parameter for optimization. It turned out to loop infinitely on my dataset.
I used instead Ridge regression regularization fixing the polynomial at 2nd degree, but kept the original code of my source converting it to comments. The minimization function of Ridge regression contains the model weight values and sets a limit for them, thus reducing overfitting and addressing multicollinearity — represented by the alpha value — as explained here.
Important note: the alpha value is set to default = 1, so it is not a machine learning method, since no optimization is performed for alpha value. However, for a future rehearsal the code is prepared for upgrading to represent a proper machine learning.
A new pipeline has been made, the listed functions are performed consecutively. More parameters could have been included, e.g. standard scaling. In that case a pipeline would make more sense, eventually combining multiple functions into a simple code avoiding redundancy. In my case, the standardized features were already there from linear regression (please see the source of the code here):
It seems the model improved significantly, although the score is still less than I hoped for at the beginning. I repeated the sequence of calculating predictions and errors as well as texture classification extending the dataframe:
It seems some of the extreme prediction values went gone, a definite improvement. The trend in errors sadly remained, but turned into a more like 2nd order figure:
I counted the True and False values of texture predictions for qualification of the two models. It increased by approx. 10%:
Up until this point I have deployed and compared two different models for predicting soil texture from the selected features. Both models have limited predictive power, so I tried to refine them by adding weights and selecting the two most significant features for building the models, but this is the topic of my next blog post. 3rd section also include slightly advanced plotting by displaying multiple plots at once for better comparability of models.