Where Can I Get More Training Data?
There are three main paths to get training data for machine learning projects. The first path is to explore free options via open datasets, online machine learning forums, and dataset search engines. The second path is to evaluate your internal options and see if there is a way to repurpose the data you already have. Finally, the last and often most efficient option is to outsource training data services from a third party.
In this section, we’ll look more closely at each of these methods, suggest some potential sources of data, and set out the pros and cons so that you can decide which is best for you.
Free Options
There are numerous websites where you can download free datasets online. Here are a few of the most popular:
Google Dataset Search: In early 2020, Google took its dataset search engine out of beta, effectively releasing over 25 million open datasets to the public from various organizations and research teams.
Kaggle: A subsidiary to Google, Kaggle is a very popular website for data science. It features a page for people to share and download datasets, machine learning guides, and more.
Reddit: There are also many well-moderated forums for machine learning available on Reddit. There are different subreddits for various skill levels where you can get advice from data scientists all over the world, as well as find datasets, tools, and other ML resources. In particular, we recommend r/learnmachinelearning, r/artificial, and r/datascience.
Scrape Web Data
Web scraping is the extraction of data from various public online resources, such as government websites or certain social media platforms. Various web scraping tools can be programmed to search for new data automatically based on your specifications. A couple of good examples of datasets made via web scraping are the Wikipedia Articles Dataset and the Airline Twitter Sentiment Dataset.
It’s generally legal to scrape web data for personal use, as this falls under fair use policy. However, scraping data for commercial purposes is a little more complicated. If you want to use the data in this way, make sure to do your research and read the Terms of Service for the site you want to scrape before beginning. You could also reach out to the owner of the site and clarify your position with them.
Lionbridge Datasets
To help you get access to the best open datasets, our staff carefully looks through various online machine learning resources and compiles the best datasets for a range of machine learning use cases. Before you start scouring the web, take a look at one of the 300+ datasets we’ve curated on our blog. You can also use our datasets page to search for datasets by field or use case.
Sometimes Free Resources Aren’t Enough
Most of the time, open datasets consist of information that is publicly available through government sites or social media. While there are an increasing number of useful open datasets available online, there will be times where free options can’t get you the training data you need.
Luckily, there are other inexpensive ways to create custom datasets for your specific use cases.
Internal Options
Before opting to outsource training data services, you should first check to see what in-house options you have available and if they’ll help you to create the datasets that you need. For example, if you’re building a chatbot to handle online inquiries, you should get in touch with your customer service department to see if they have stored chat logs or email threads you can use to train your model. Of course, data availability depends highly on the problem you are trying to solve with your ML project.
Create New Data from Current Resources via Data Augmentation: Before you look for datasets elsewhere, you should try to repurpose the data you already have to build a larger dataset. One common way to do this is through data augmentation. For image datasets especially, there are numerous simple ways to increase your training data through simple image rotations, color contrasts, and other image manipulations. To learn more about how to do this yourself, take a look at our complete guide to data augmentation.
Paid Options
Sometimes free and internal options aren’t able to provide you with machine learning datasets at the scale and quality you require. In these cases, it’s often more efficient to simply outsource training data from a data annotation company rather than build a data collection and annotation infrastructure on your own. Luckily, there are a variety of training data outsourcing options available to you.
Outsourcing Data Collection: One option is to partner with a data collection company. For example, if you are building a voice recognition system and require voice samples from 200 different people, you could simply hire a company to record the audio files for you.
One of the main advantages of this method is that the data collection company will handle all of the project management tasks for you. From finding and training contributors to reviewing the data for accuracy, your project is completely managed by the training data company. All you need to do is provide specific guidelines.
Outsourcing Data Annotation: If you have the data, but don’t have the tools or workforce to annotate the data internally, you can offload all of your annotation tasks by partnering with a data annotation company. These companies can provide the raw data itself, a platform for labeling the data, and a trained workforce to label the data for you. Companies like Lionbridge already have platforms built to collect and annotate data, as well as a large trained workforce that can annotate hundreds of thousands of data points at scale.
Once again, the main advantage of partnering with a data annotation company is that you don’t have to deal with building a data annotation infrastructure from scratch. All you have to do is build specific guidelines and QA protocols for the company to follow.
If you decide to annotate your data yourself, there are a variety of options to consider. In the next section, we’ll review some of the more popular annotation tools on the market to help you make an educated choice.