List of datasets for machine-learning research

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.

List of sorting used for datasets

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions. == List of open data portals ==

List of open data portals

== List of portals suitable for multiple types of applications ==

List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications. == List of portals suitable for a specific subtype of applications ==

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections. == Image data ==

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis. Reviews News articles Messages Twitter and tweets Dialogues Legal Other text == Sound data ==

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis. Speech Music Other sounds == Signal data ==

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis. Electrical Motion-tracking Other signals == Chemical data ==

Chemical data

Datasets from physical systems. Chemical Reactions with transition states (TS) OpenReACT-CHON-EFH OpenReACT-CHON-EFH (Open Reaction Dataset of Atomic ConfiguraTions comprising C, H, O and N with Energies, Forces and Hessians) is a 2025 open-access benchmark for machine-learning interatomic potentials. • **RTP set** – 35,087 stationary-point geometries (reactant, transition state and product) drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G(d) level. • **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points. • **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations. The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein. The dataset itself is distributed under a CC licence via Figshare. == Physical data ==

Physical data

Datasets from physical systems. High-energy physics Systems Astronomy Earth science Other physical == Biological data ==

Biological data

Datasets from biological systems. Human Animal Fungi Plant Microbe Drug discovery == Anomaly data ==

Question answering data

This section includes datasets that deals with structured data. == Dialog or instruction prompted data ==

Dialog or instruction prompted data

This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request. == Cybersecurity ==

Cybersecurity

== Climate and sustainability ==

Multivariate data

Financial Weather Census Transit Internet Games Other multivariate == Curated repositories of datasets ==

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research. • OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms. • PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API. • Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic. • Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com