Created by researchers at Google Brain, TensorFlow is one of the largest open-source data libraries for machine learning and data science. It’s an end-to-end platform for both complete beginners and experienced data scientists. The TensorFlow library includes tools, pre-trained models, machine learning guides, as well as a corpora of open datasets. To help you find the training data you need, this article will briefly introduce some of the largest TensorFlow datasets for machine learning. We’ve divided the following list into image, video, audio, and text datasets.
TensorFlow Image Datasets
-
CelebA: One of the largest publicly available face image datasets, the Celebrity Faces Attributes Dataset (CelebA) contains over 200,000 images of celebrities. Each image includes 5 facial landmarks and 40 binary attribute annotations.
-
Downsampled Imagenet: This dataset was built for density estimation and generative modeling tasks. It includes just over 1.3 million images of objects, scenes, vehicles, people, and more. The images are available in two resolutions: 32 x 32 and 64 x 64.
-
Lsun – Lsun is a large-scale image dataset created to help train models for scene understanding. The dataset contains over 9 million images divided into scene categories, such as bedroom, classroom, and dining room.
-
Bigearthnet – Bigearthnet is another large-scale dataset, containing aerial images from the Sentinel-2 satellite. Each image covers a 1.2 km x 1.2 km patch of the ground. The dataset includes 43 imbalance labels for each image.
-
Places 365 – As the name suggests, Places 365 contains over 1.8 million images of different places or scenes. Some of the categories include office, pier, and cottage. Places 365 is one of the largest datasets available for scene recognition tasks.
-
Quickdraw Bitmap – The Quickdraw dataset is a collection of images drawn by the Quickdraw player community. It contains 5 million drawings that span 345 categories. This version of the Quickdraw dataset includes the images in grayscale 28 x 28 format.
-
SVHN Cropped – From Stanford University, Street View House Numbers (SVHN) is a TensorFlow dataset built to train digit recognition algorithms. It contains 600,000 examples of real-world image data which have been cropped to 32 x 32 pixels.
-
VGGFace2 – One of the largest face image datasets, VGGFace2 contains images downloaded from the Google search engine. The faces vary in age, pose, and ethnicity. There are an average of 362 images of each subject.
-
COCO – Made by collaborators from Google, FAIR, Caltech, and more, COCO is one of the largest labeled image datasets in the world. It was built for object detection, segmentation, and image captioning tasks. The dataset contains 330,000 images, 200,000 of which are labeled. Within the images are 1.5 million object instances across 80 categories.
-
Open Images Challenge 2019 – Containing around 9 million images, this dataset is one of the largest labeled image datasets available online. The images contain image-level labels, object bounding boxes, and object segmentation masks, as well as visual relationships.
-
Open Images V4 – This dataset is another iteration of the Open Images dataset mentioned above. V4 contains 14.6 million bounding boxes for 600 different object classes. The bounding boxes have been manually drawn by human annotators.
-
AFLW2K3D – This dataset contains 2000 facial images all annotated with 3D facial landmarks. It was created to evaluate 3D facial landmark detection models.
Video Datasets
-
UCF101 – From the University of Central Florida, UCF101 is a video dataset built to train action recognition models. The dataset has 13,320 videos that span 101 action categories.
-
BAIR Robot Pushing – From Berkeley Artificial Intelligence Research, BAIR Robot Pushing contains 44,000 example videos of robot pushing motions.
-
Moving MNIST – This dataset is a variant of the MNIST benchmark dataset. Moving MNIST contains 10,000 videos. Each video shows 2 handwritten digits moving around within a 64 x 64 frame.
-
EMNIST – Extended MNIST contains digits from the original MNIST dataset converted into a 28 x 28 pixel format.
TensorFlow Audio Datasets
-
CREMA-D – Created for emotion recognition tasks, CREMA-D consists of vocal emotional expressions. This dataset contains 7,442 audio clips voiced by 91 actors of varying age, ethnicity, and gender.
-
Librispeech – Librispeech is a simple audio dataset which contains 1000 hours of English speech derived from audiobooks from the LibriVox project. It has been used to train both acoustic models and language models.
-
Libritts – This dataset contains around 585 hours of English speech, prepared with the assistance of Google Brain team members. Libritts was originally designed for Text-to-speech (TTS) research, but can be used for a variety of voice recognition tasks.
-
TED-LIUM – TED-LIUM is a dataset that consists of over 110 hours of English TED Talks. All talks have been transcribed.
-
VoxCeleb – A large audio dataset built for speaker identification tasks, VoxCeleb contains over 150,000 audio samples from 1,251 speakers.
Text Datasets
-
C4 (Common Crawl’s Web Crawl Corpus) – Common Crawl is an open source repository of web page data. It’s available in over 40 languages and spans seven years of data.
-
Civil Comments – This dataset is an archive of over 1.8 million examples of public comments from 50 English-language news sites.
-
IRC Disentanglement – This TensorFlow dataset includes just over 77,000 comments from the Ubuntu IRC Channel. The metadata for each sample includes the message ID and timestamps.
-
Lm1b – Known as the Language Model Benchmark, this dataset contains 1 billion words. It was originally made to measure progress in statistical language modeling.
-
SNLI – The Stanford Natural Language Inference Dataset is a corpus of 570,000 human-written sentence pairs. All of the pairs have been manually labeled for balanced classification.
-
e-SNLI – This dataset is an extension of SNLI mentioned above, which contains the original dataset’s 570,000 sentence pairs classified as: entailment, contradiction, and neutral.
-
MultiNLI – Modeled after the SNLI dataset, MultiNLI includes 433,000 sentence pairs all annotated with entailment information.
-
Wiki40b – This large-scale dataset includes text from Wikipedia articles in 40 different languages. The data has been cleaned and non-content sections, as well as structured objects, have been removed.
-
Yelp Polarity Reviews – This dataset contains 598,000 highly polar Yelp reviews. They have been extracted from the data included in the Yelp Dataset Challenge 2015.
While the datasets above are some of the largest and most widely-used TensorFlow datasets for machine learning, the TensorFlow library is vast and continuously expanding. Please visit the TensorFlow website for more information about how the platform can help you build your own models.
Still can’t find the training data you need? At Lionbridge, we use our state-of-the-art AI platform to create custom datasets at scale. Contact our sales team or sign up for a free trial to start building high-quality datasets today.