Some of the best datasets for data science projects are those created for linear regression, predictive analysis, and simple classification tasks. This list will include the best resources from our past dataset articles tailored for said tasks. We’ll also highlight some of the best websites to search for open datasets on your own.
MNIST Datasets
The original MNIST dataset is considered a benchmark dataset in machine learning because of its small size and simple, yet well-structured format. It is often used as a test dataset to compare algorithm performance. The dataset contains a total of 70,000 images (split into 60,000 for training and 10,000 for testing). The original dataset can be found here and below are other variations of the original MNIST dataset.
-
EMNIST is a series of 6 datasets created from the original NIST Database.
-
The MNIST as JPG dataset is a simple reformatting of the original data into JPG files.
-
3D MNIST is a 3D point cloud version of the original MNIST dataset.
-
Fashion MNIST is a dataset from the large clothing retailer, Zalando. It contains 70,000 product images from Zalando’s catalogue structured in the MNIST format.
-
Skin Cancer MNIST: HAM10000 is a medical image dataset with over 10,000 images of skin lesions.
Linear Regression Datasets for Data Science
Linear regression and predictive analytics are among the most common tasks for new data scientists. Below are some of the best datasets to work with for regression tasks or training predictive models.
-
The Cancer Linear Regression dataset consists of information from cancer.gov. The dataset includes statistics about deaths due to cancer in the United States.
-
The CDC Data: Nutrition, Physical Activity, Obesity dataset comes from the CDC’s Behavioral Risk Factor Surveillance System. The author of this dataset used it to study how socioeconomic factors influence obesity.
-
The Medical Insurance Costs dataset comes from "Machine Learning with R", a book by Brett Lantz. The dataset consists of 1,338 rows of data regarding patient information and health insurance charges.
-
The OLS Regression Challenge tasked participants with predicting the mortality rate of cancer in US counties. The dataset contains the following information: death rates, reported cases, US county name, income per county, population, and demographics.
-
Real Estate Price Prediction is a dataset originally compiled for regression analysis, linear regression, multiple regression, and predictive tasks. The dataset consists of purchase date, age of property, location, house price of unit area, and distance to nearest station.
Stock Market Datasets
Some people have looked to machine learning algorithms to predict the rise and fall of individual stocks. Even if you have no interest in the stock market, many of the datasets below are great resources to practice building simple regression algorithms or predictive models.
-
The Historical Stock Market Dataset includes historical prices and volume information for US stocks and ETFs trading.
-
From one of the largest clothing retailers in Japan, the Uniqlo Stock Price Prediction dataset contains the company’s historical stock information.
-
Currency Exchange Rates includes the daily currency exchange rates of 51 currencies from 1995 to 2018.
-
Daily Prices for All Cryptocurrencies is a large dataset that includes historical price data for all cryptocurrencies on the market from April 28th, 2013 to November 30th, 2018.
-
Originally prepared for a machine learning class, the News and Stock dataset is great for binary classification tasks. It contains historical news headlines taken from Reddit’s r/worldnews subreddit.
Image Classification Datasets for Data Science
When you’re ready to begin delving into computer vision, image classification tasks are a great place to start. Here are 5 of the best image datasets to help get you started.
-
The Recursion Cellular Image Classification dataset comes from the Recursion 2019 challenge. The competition tasked participants with using biological microscopy data to develop a model that could identify replicates.
-
Uploaded on tensorflow.org, the TensorFlow patch_camelyon Medical Images dataset contains just over 327,000 color images. Each image is 96 x 96 pixels.
-
From MIT, the Indoor Scenes Images dataset contains over 15,000 images of indoor settings and locations. Used for training indoor scene recognition models, all images are in JPEG format. The images have been divided into 67 classes, with at least 100 images in each class.
-
The Intel Image Classification dataset was originally created for an Intel contest. It contains around 25,000 images divided into numerous categories. The data is divided into folders for testing, training, and prediction.
-
Sun397 Image Classification Dataset is another dataset from Tensorflow, containing over 108,000 images divided into 397 categories.
Text Classification Datasets
Aside from image classification, there are also a variety of open datasets for text classification tasks.
-
Recommender Systems Datasets is a repository of datasets used by Julian McAuley, a computer science professor at UCSD. The datasets include text data from various outlets, such as product reviews, social networks, and question/answer data.
-
The Large Movie Review Dataset comes from the Stanford AI Laboratory. This dataset includes 50,000 movie reviews (25,000 for testing and 25,000 for training) perfect for building and evaluating sentiment analysis algorithms.
-
The Twitter US Airline Sentiment Dataset contains tweets classified as positive, negative, and neutral, with around 15,000 tweets about six different airlines.
-
Another dataset using Twitter data, the Hate Speech and Offensive Language Dataset was used to research hate-speech detection. The text is classified as: hate-speech, offensive language, and neither. Due to the nature of the study, it’s important to note that this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive.
-
Hate clickbait? You’re not the only one. The Stop Clickbait Dataset was used in the machine learning paper "Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media". This dataset includes 16,000 article headlines pulled from websites like Buzzfeed, The New York Times, Upworthy, and The Guardian. All of the headlines have been categorized as "clickbait" or "non-clickbait".
Best Places to Find Datasets for Data Science
Still struggling to find the perfect dataset for your data science project? Below are a list of some of the best places to search for datasets on your own.
-
Kaggle
-
Google Dataset Search
-
Ultimate Dataset Aggregator
Build Custom Datasets
It’s possible that you won’t be able to get the data you need through public or open data resources. If you find yourself in this situation, you should look into building your own custom datasets through Lionbridge’s AI training data services.
With a network of data scientists and a state-of-the-art data annotation platform, Lionbridge can help provide you with high-quality ground truth data for a variety of use cases. Learn more about building custom datasets by contacting our sales team.