Data Checklist for AI projects
Great projects need great data. It’s not our opinion.
It’s a fact.
Let’s see how we can select and prepare it in the best way.
We prepared a list of the main factors that you should consider!
1. Make sure that you have enough data
AI development requires a significant volume of data. The more high results you need the more data you should have. For example:
- for table-data, like sales dynamics, you should have at least 1000 rows of uniform information
- for texts processing, like news analysis, you should have at least 100 articles or important parts of the articles
- for image processing, like cats and dogs classification, you should have at least 100 images for each class (100 cats and 100 dogs)
- for video processing, for example, cars tracking, you should have 5000 frames for each environment (5000 frames for a day-light, 5000 for a night-light and so on)
- for sound processing, for example, “Ok, Google” phrase detection, you should have at least 100 records with this phrase.
2. Make sure that your data is balanced
If you want to make a classification between cats and dogs, between verbal “Hi!” and “By!” between good and bad news, be sure that each class has equal or almost equal sizes.
It means 100 cats and 100 dogs, 100 articles with “good” and 100 articles with “bad” news, and so on.
3. Make sure that your data is rich and varied
It means that you should have 100 pictures of different cats and dogs, from different foreshortening.
Or that “Hi!” and “Bye!” were said by different people with different intonations.
4. Make sure that your data is uniform
For example, your notes about sales have the same information and format, that there are just cats, but no tigers, that everybody says “Hi!”, but not a “Salut!”.
If you need to process tigers and “Salut!” – they should be added as new items to the dataset: 100 tigers and 100 phrases.
5. Make sure that your data has the right format
For example, it is not a good idea to store images in PowerPoint or to store texts as images.
The simpler format, the better.
6. Make sure that your data is reasonable
Dataset has to contain the data, which has predictive power.
For example, if you would like to classify cats and dogs, images should contain an understandable picture of a cat or a dog.
Not just the front right paw or left ear.
7. Make sure that your data is good enough
Essentially, the data shouldn’t have too much noise or interference.
It means that a picture must have a sufficient resolution and clarity to be understood, that the said phrase is understandable to the human ear, etc.