Datasets

Machine Learning revolves around data. Before constructing an ML model for a specific task, careful consideration of the data is essential. Once the relevant data is identified, it must undergo preparation to optimize it for ML model training. In the realm of ML, the term “dataset” refers to the prepared and structured data that serves as the foundation for training and evaluating models. Datasets can be Private or Public.

Enterprises involved in building ML models use their own propritory data to create the datasets. Typically these datasets are closely guarded by them as it may contain sensitive data. In other words these datasets are private to the organization that creates it.

Public datasets are available in the public domain. These datasets are mostly created with public data by multiple entities such as:

  • Research institutions & univeristies
  • Non profit organizations
  • Government agencies
  • Technology companies
  • Individuals
  • Collaborative efforts (Academia, Private sector, Government …)

References

Hugging Face Datasets

Hugging Face Datasets SQL console

Hugging Face Datasets Classes

High quality human annotated chat

LLM-as-a-Judge HuggingFace blog

Platypus - training for logical reasoning

Llama-factory example datasets

Multi-turn chat dataset

Public Datasets

Owner of the public datasets make these datasets available for general or even commercial use via their own portal, Git repositories, and/or on public machine learning hub portal such as:

Kaggle Hugging Face Datasets

Databricks Dolly 15K

https://huggingface.co/datasets/databricks/databricks-dolly-15k?row=49

DataBricks invited its employees to assist with building a dataset for QA/Summarization/…

Fine tuning datasets

SQL coder:

OpenAI sample datasets: Example datasets from OpenAI as samples.

Marketing emails

Marketing emails: Email templates for product marketing.

Patient-Dr conversations

Code Contest Dataset

Alpaca conversational: Alpaca dataset converted to JSON message format.

Google C4 dataset

Types of datasets

During fine-tuning, the validation dataset and the test dataset serve distinct but important roles in ensuring that the model performs well and generalizes effectively:

1. Validation Dataset:

  • Purpose: The validation dataset is used to monitor the model’s performance during the fine-tuning process and guide decisions on hyperparameters (like learning rate or number of epochs).
  • How it’s used:
    • After each epoch (or periodically during training), the fine-tuned model is evaluated on the validation set.
    • This helps check if the model is overfitting to the training data or not generalizing well.
    • Key metrics (such as accuracy, loss, or F1-score) are measured on this set, and adjustments can be made accordingly.
  • Early Stopping: Validation performance is often used to trigger early stopping—stopping the fine-tuning process when the model’s performance on the validation set starts to degrade, which indicates overfitting.

2. Test Dataset:

  • Purpose: The test dataset is used to evaluate the model’s performance after fine-tuning is complete. It provides an unbiased assessment of the model’s generalization to unseen data.
  • How it’s used:
    • The test set is not used during training or for hyperparameter tuning. It serves as a final check of how well the model is expected to perform in real-world scenarios.
    • After fine-tuning, the model is evaluated on this test set, and metrics like accuracy, precision, recall, and F1-score are calculated to give a clear picture of its true performance.

Summary:

  • Validation set is used during fine-tuning to monitor and guide the training process, helping adjust parameters and prevent overfitting.
  • Test set is used only after fine-tuning is complete to provide an unbiased evaluation of the model’s performance on unseen data.