Datasets

Machine Learning revolves around data. Before constructing an ML model for a specific task, careful consideration of the data is essential. Once the relevant data is identified, it must undergo preparation to optimize it for ML model training. In the realm of ML, the term “dataset” refers to the prepared and structured data that serves as the foundation for training and evaluating models. Datasets can be Private or Public.

Enterprises involved in building ML models use their own propritory data to create the datasets. Typically these datasets are closely guarded by them as it may contain sensitive data. In other words these datasets are private to the organization that creates it.

Public datasets are available in the public domain. These datasets are mostly created with public data by multiple entities such as:

  • Research institutions & univeristies
  • Non profit organizations
  • Government agencies
  • Technology companies
  • Individuals
  • Collaborative efforts (Academia, Private sector, Government …)

Public Datasets

Owner of the public datasets make these datasets available for general or even commercial use via their own portal, Git repositories, and/or on public machine learning hub portal such as:

Kaggle Hugging Face Datasets

Examples

Databricks Dolly 15K

https://huggingface.co/datasets/databricks/databricks-dolly-15k?row=49

DataBricks invited its employees to assist with building a dataset for QA/Summarization/…

References

Hugging Face Datasets

Hugging Face Datasets Classes

Code Contest Dataset