Predicting age of a crab in mud crab farming using Machine Learning

Overview
Crab farming is a major aquaculture activity as there is a huge consumption demand of crabs in India. Commercial crab farming is a growing business in coastal areas of India and is looking profitable. Mud crab is highly popular due to its great demand in the export market. The commercial scale mud crab culture is developing fast along the coastal areas of Andhra Pradesh, Tamil Nadu, Kerala and Karnataka.
About Mud Crabs
Larger Species: The larger species is locally known as `green mud crab’. It grows to a maximum size of 22 cm carapace width and 2 kg in weight.
Smaller Species: The smaller species is known as `red claw’. This grows to a maximum size of 12.7 cm carapace width and 1.2 kg in weight.
For more about Mud crab farming, please visit this link. To understand more on Mud crab, visit this link.
Business Problem
For a commercial crab farmer knowing the right age of the crab helps them decide if and when to harvest the crabs. Beyond a certain age, there is negligible growth in crab’s physical characteristics and hence, it is important to time the harvesting to reduce cost and increase profit.
Data Description
We have our dataset publicly available on Kaggle. The dataset has following columns. Please note that units of the data pointers are not mentioned by the data creator but a general height and weight units are provided above.
- Sex : Gender of crab (Male and Female)
- Length : Length of crab
- Diameter : Diameter of crab
- Height : Height of crab
- Weight : Weight of crab
- Shucked Weight : Weight of crab without shell
- Viscera Weight : is weight that wraps around your abdominal organs deep inside body
- Shell Weight : Weight of shell
- Age : Age of crab
Approach
Our business wants to predict the Age of crabs so that they can do crab harvesting at right time to gain profit. Problem is a regression one and we can follow the steps as given below to develop a model —
- Perform exploratory data analysis — Observe various features which are impacting the Age of a mud crab. We have Height, weight, width of the grab which does impact the Age of the crab
- Prepare Data — Clean data — missing values, unknown values, encoding to ensure that data is ready for algorithm to consume
- Split data — Spit your data into training and test data. I went for 80–20 split
- Choose an algorithm — Identifying a right algorithm for the problem is a major task and mostly it just don’t happen in one go. I went for Linear Regression algorithm as all features, i.e., height, weight, width, etc. have a linear relationship with Age
- Predict and evaluate model by using various metrics used for Linear Regression, i.e., RMSE, MSE, MEA.
Results of Exploratory Data Analysis




Graphs shown above indicates —
- There is a liner relationship between Weight, Height, Diameter, Length, etc. with Age. This is quite obvious nature of living organisms where their attributes like height, weight, length, etc. increases with Age

- We have almost an equal distribution of Females, Males and Intermediate sex crabs in the dataset and so we can conclude that our dataset does represent all classes equally

- Above graph clearly shows that almost all features are positive co-related with each other and impacts the prediction label, i.e. Age
Preparing Data for training
- Perform OneHotEncoding for Sex Column
- Separate dependent and independent variables (features and labels)
- Perform train, test split
- Create Model object and train model
- Perform Predictions

Outcome
We can now use sklearn Linear Regression to predict the Age of crab and evaluate model too. Source code with all details is available on my Kaggle notebook.
Improving Model
One can try following to improve the model prediction -
- Try other algorithms — Decision Trees, Random Forest, etc.
- Perform Feature selection. Remove features which are not impacting the outcome