Datasets
Machine Learning revolves around data. Before constructing an ML model for a specific task, careful consideration of the data is essential. Once the relevant data is identified, it must undergo preparation to optimize it for ML model training. In the realm of ML, the term “dataset” refers to the prepared and structured data that serves as the foundation for training and evaluating models. Datasets can be Private or Public.
Enterprises involved in building ML models use their own propritory data to create the datasets. Typically these datasets are closely guarded by them as it may contain sensitive data. In other words these datasets are private to the organization that creates it.
Public datasets are available in the public domain. These datasets are mostly created with public data by multiple entities such as:
- Research institutions & univeristies
- Non profit organizations
- Government agencies
- Technology companies
- Individuals
- Collaborative efforts (Academia, Private sector, Government …)
References
Hugging Face Datasets
Hugging Face Datasets SQL console
Hugging Face Datasets Classes
High quality human annotated chat
LLM-as-a-Judge HuggingFace blog
Platypus - training for logical reasoning
Llama-factory example datasets
Multi-turn chat dataset
Public Datasets
Owner of the public datasets make these datasets available for general or even commercial use via their own portal, Git repositories, and/or on public machine learning hub portal such as:
Kaggle
Hugging Face Datasets
Databricks Dolly 15K
https://huggingface.co/datasets/databricks/databricks-dolly-15k?row=49
DataBricks invited its employees to assist with building a dataset for QA/Summarization/…
Fine tuning datasets
SQL coder:
OpenAI sample datasets:
Example datasets from OpenAI as samples.
Marketing emails
Marketing emails:
Email templates for product marketing.
Patient-Dr conversations
Code Contest Dataset
Alpaca conversational:
Alpaca dataset converted to JSON message format.
Google C4 dataset
Types of datasets
During fine-tuning, the validation dataset and the test dataset serve distinct but important roles in ensuring that the model performs well and generalizes effectively:
1. Validation Dataset:
- Purpose: The validation dataset is used to monitor the model’s performance during the fine-tuning process and guide decisions on hyperparameters (like learning rate or number of epochs).
- How it’s used:
- After each epoch (or periodically during training), the fine-tuned model is evaluated on the validation set.
- This helps check if the model is overfitting to the training data or not generalizing well.
- Key metrics (such as accuracy, loss, or F1-score) are measured on this set, and adjustments can be made accordingly.
- Early Stopping: Validation performance is often used to trigger early stopping—stopping the fine-tuning process when the model’s performance on the validation set starts to degrade, which indicates overfitting.
2. Test Dataset:
- Purpose: The test dataset is used to evaluate the model’s performance after fine-tuning is complete. It provides an unbiased assessment of the model’s generalization to unseen data.
- How it’s used:
- The test set is not used during training or for hyperparameter tuning. It serves as a final check of how well the model is expected to perform in real-world scenarios.
- After fine-tuning, the model is evaluated on this test set, and metrics like accuracy, precision, recall, and F1-score are calculated to give a clear picture of its true performance.
Summary:
- Validation set is used during fine-tuning to monitor and guide the training process, helping adjust parameters and prevent overfitting.
- Test set is used only after fine-tuning is complete to provide an unbiased evaluation of the model’s performance on unseen data.