AI Projects – What to Do If You Don’t Have Enough Data
Six tips to create and improve the dataset of your ML projects
Let’s be realistic. Machine learning projects almost never have enough data.
Sad but true, uh?
This happens because machines “think” differently than humans. One of the main differences is that they need A LOT more data to learn. What does it mean in practice?
It means that computers need to see a lot of samples before they can begin to perform their tasks properly. But then they will do their job faster, better, and more accurately than humans.
Considering this, if you want to increase the probability of success for your ML project you really should know how to expand its dataset.
Probably we can help you with some useful tips. So, go on reading!
Where to find the data you need
You can choose among many strategies to increase your dataset.
Here are some of the main ones.
1. Creating the necessary data on your own
It is the most obvious solution. But, be careful: if you start to create data from scratch without the knowledge of what exactly will be needed for AI training, you may lose time and money.
In order to prevent this, you should understand two things: what data and how much data you need.
If you have to solve the first problem, Data Specification will help you. It is a simple document, where you need to specify at least the following points:
- what kind (images, texts, video, etc) of data the project requires
- what classes of data the project requires (if the target of the project is to identify cats and dogs, it will have at least 2 classes: (1) cats and (2) dogs)
- what quality and data or data format should be used
- what restrictions the project data has
It will also be great to put in this document some positive and negative examples.
Instead, the amount of data you need is not so obvious and it’s quite difficult to calculate. But we would like to share with you our years of experience in one table:
This table offers you very generic recommendations that are based on our experience and knowledge. Please, keep in mind that your project can easily require different numbers. Anyway, our advice can be a good starting point!
2. Data augmentation
This option requires that you already have some data. Basically, this technique can be a solution when your volume of data is not enough.
For example, you have 2 000 images, but you need more than 100 000. In such a case, data augmentation can be a good option for you.
Data augmentation is a strategy that allows creating new data samples from existing samples via a set of linear operations (for example, affine translations for image processing).
Let me show you a practical case. Imagine that you want to classify cats and dogs. You expect to reach a good level of quality for your classifier, but you have just 2 000 images (1 000 cats and 1 000 dogs). How is it possible to increase this numb