Introduction to my dataset of choice
When you complete an introductory course on data science, you usually go and practice by yourself what you have learned. Fortunately, plenty of sources offer free datasets that you can use.
This is the case for myself as well, but I am fortunate to have a dataset of my own. OK it’s not owned by me, but I lead the team that generated the data. Data is from a soil and plant analysis laboratory.
Because it is not for the public, some parts of the code won’t be shown for privacy concerns.
Important aspects of my dataset
- data was gathered for several months in a soil and plant analysis laboratory, from chemical and physical analysis of physically treated soil and plant samples
- part of the pipeline was quality checking the data before publishing to customers, so any outlying data could be double-checked ensuring service quality
- the high throughput of samples, the customer oriented management system and the highly competing business environment emphasized the importance of quality checking
- some relationship among parameters have already been set by professionals empirically. The idea was to quantize any relationship present driven by this empirical hint. Reproducing the educated guess would be reassuring and is my primary goal
- I selected one parameter - which is the most labor intensive by far - as the dependent parameter
- the lab offers several service packages of parameters for different purposes and I selected the two most abundant packages for data analysis. These packages contain limited amount of parameters and that has to be stated
- the selected variables:
1. Liquid limit (Arany-féle kötöttségi szám) — dependent variable
3. specific conductivity (μS/cm^-1) — as a representation of soluble salt content
4. calcium carbonate content (weight %) — determined by acidic digestion
5. humus materials (weight %) — determined by acidic digestion
6. nitrite-nitrate content (mg / kg dry matter) — determined by solvent extraction
7. phosphorus content in phosphorus pentoxide equivalent (mg / kg dry matter) — determined by solvent extraction
8. potassium content in potassium oxide equivalent (mg / kg dry matter) — determined by solvent extraction
The empirical hint is that the liquid limit — from which the physical classification of soil samples is derived — is primarily determined by the humus materials and soluble salt content (represented by specific conductivity) and may affected by the other parameters. Analyzing the features would enable to infer the physical classification of soil samples by predicting liquid limit.
A brief summary of my workflow
- Preparation of dataset for analysis, description of dataset
- Apply at first a multiple linear regression model, then a 2nd degree polynomial model. Description and comparison of models by some simple visualization and collecting model parameters
- Apply weighted multiple linear and polynomial regression models, description and comparison of models by slightly advanced visualization
Each of the aforementioned list elements are separate blog posts. This post is concerned about #1, preparation of chosen dataset.
Coding and comments
I had a precleaned excel table and have already got rid of plant samples data. I import the relevant libraries:
I brought the setup from the data science course, however I won’t use statsmodels this time. Regression will be done by sklearn instead. I store the dataset in pandas dataframe, use numpy for matrix operations and matplotlib for visualization and seaborn as an extension for matplotlib with better aesthetics and more options for plotting.
I read the excel file, check the shape and get the info from the dataset:
Here lies the challenge: I have 46 739 x 23 = 1 074 997 dataframe entries, different but significant counts of null entries, most of them stored as objects instead of floats or integers. It was not apparent at the moment, but later on that spaces in column names ruin the code (e.g. collecting index values), so I needed to rename all the headers as well as excluding spaces.
Historically, I already knew that data classified as ’H1’ and ’H2’ service packages in ’Metódusok’ column are the most abundant, so I got to get rid of the irrelevant rows, along with sensitive data. When done, I got rid of ’Metódusok’ as well:
I checked for the .head() frequently to keep an eye on the process, I presented only the final dataframe. It was much better as for the ’NaN’ entries, but way much more to go. I worked with a 46 739 x 8 = 373 912 entries after all:
I got the final shape: 33 996 rows by 8 columns. Time was for formatting the column headers and index values (I know some lines could have been merged as the method allows):
As you can see Out  contains '<' characters and sometimes ',' instead of '.' as separator. Chemical analysis has lower limit of quantification (LLOQ), so anything lower than a defined value presented as '<LLOQ’. Sometimes a range is defined, so casually an upper limit appears as '>ULOQ’. The ',' separator comes from Hungarian language, so I need to replace those:
Finally I could turn the entries into floats and integers:
Data is clean and ready to be exported for analysis and visualization:
So finally I had a clean dataframe with useful entries. However, an important lesson was that a clean table does not necessarily mean sensible data. Considering the dataframe clean is entirely from data handling point of view, but the unjustifiable data had to be dealt with before building the model. This was part #2 of my approach, so I will continue my findings in the next blog post.