Machine Learning revolves around data. Before constructing an ML model for a specific task, careful consideration of the data is essential. Once the relevant data is identified, it must undergo preparation to optimize it for ML model training. In the realm of ML, the term “dataset” refers to the prepared and structured data that serves as the foundation for training and evaluating models. Datasets can be Private or Public.
Enterprises involved in building ML models use their own propritory data to create the datasets. Typically these datasets are closely guarded by them as it may contain sensitive data. In other words these datasets are private to the organization that creates it.
Public datasets are available in the public domain. These datasets are mostly created with public data by multiple entities such as:
Owner of the public datasets make these datasets available for general or even commercial use via their own portal, Git repositories, and/or on public machine learning hub portal such as:
https://huggingface.co/datasets/databricks/databricks-dolly-15k?row=49
DataBricks invited its employees to assist with building a dataset for QA/Summarization/…