Gen AI Guide > Datasets > Fine tuning dataset

Fine tuning dataset

While the steps involved in preparing a dataset for domain adaptation and instruction fine-tuning are similar, there are important distinctions between the two processes, especially in terms of data preparation, objectives, and the specific types of tasks each one addresses.

1. Domain Adaptation

Goal: Domain adaptation fine-tunes a model to specialize in a specific domain (e.g., legal, medical, financial) so that it better understands the language, style, terminology, and nuances of that domain.
Key Dataset Considerations: The dataset is domain-specific, but the structure and format of the data may be the same as during pre-training, focusing on text from that domain without specific instructions.

2. Instruction Fine-Tuning

Goal: Instruction fine-tuning teaches the model to better understand and follow explicit instructions provided in prompts, often across multiple tasks (e.g., summarization, answering questions, translation). The goal is to make the model more aligned with user queries in natural, interactive settings.
Key Dataset Considerations: The dataset includes explicit instruction-response pairs, where the model learns to follow task-specific instructions.

Now, let’s explore how each step in the fine-tuning process differs between domain adaptation and instruction fine-tuning.

Step 1: Define the Task or Objective

Domain Adaptation:
- The task is to adapt the model to a specific domain. The objective is for the model to become more proficient in generating or understanding text within that particular domain, such as medical or legal text. The focus is on specialized language, terminology, and context within the domain.
- Example: A legal chatbot that generates legal advice based on contract language.
Instruction Fine-Tuning:
- The goal is to teach the model to understand and follow explicit instructions across multiple tasks. The objective is for the model to learn to perform different tasks based on the user’s command, such as “Summarize this text” or “Translate this to French.” The focus is on aligning the model with task-specific prompts and instructions.
- Example: A general-purpose assistant that can take various instructions like “Explain this term” or “Provide a summary.”

Step 2: Select and Collect Relevant Data

Domain Adaptation:
- The data should come from the specific domain you are adapting the model to. For example, for a medical domain model, the dataset might consist of medical literature, patient records, clinical reports, or research articles.
- Sources: Domain-specific articles, legal documents, or proprietary internal data like customer reviews or support tickets.
- Data Type: Generally consists of unstructured text from the target domain (e.g., contracts, scientific articles).
Instruction Fine-Tuning:
- The dataset needs to include task-specific instructions and corresponding outputs. The data often comes from diverse sources that represent different tasks the model should learn to handle.
- Sources: Crowdsourced instruction datasets, open-domain datasets, or proprietary data with instructions such as “summarize,” “translate,” “explain,” etc.
- Data Type: Instruction-response pairs, where each example consists of an instruction (e.g., “Translate this sentence”) and the corresponding response (e.g., “Voici la phrase traduite”).

Step 3: Data Annotation (If Necessary)

Domain Adaptation:
- For domain adaptation, annotation is less critical if you’re dealing with unlabeled text. The model adapts to the language patterns and terminology of the domain through unsupervised learning (i.e., just by training on raw domain-specific text). If the task involves classification or specific predictions, labeled data may be required (e.g., labeled medical conditions).
- Supervised or Unsupervised: Can be either, depending on the task, but often unsupervised in nature if simply learning the domain style or terminology.
Instruction Fine-Tuning:
- Annotation is essential because the model must learn to follow specific instructions. You need labeled data with clear input-output pairs, where each input is a task instruction, and the output is the expected result.
- Supervised Learning: This is primarily a supervised process where each instruction must be clearly labeled with the expected outcome.

Step 4: Data Cleaning and Preprocessing

Domain Adaptation:
- Preprocessing for domain adaptation focuses on domain-specific cleaning (e.g., removing irrelevant text, abbreviations specific to the field) and tokenization. The goal is to ensure that the text fits the target domain, whether legal, medical, or financial.
- Example: Cleaning medical datasets to ensure abbreviations or medical codes are consistent, or removing non-relevant domain-specific text.
Instruction Fine-Tuning:
- Preprocessing focuses on ensuring clear instruction-response pairs. The instructions must be well-structured, unambiguous, and paired with accurate responses. Preprocessing includes ensuring that the format of instructions is consistent across the dataset.
- Example: Ensuring that the input is always structured as “Instruction: [instruction]” and the output is “Response: [response]”.

Step 5: Format the Data for Fine-Tuning

Domain Adaptation:
- Domain adaptation typically involves standard text format, as the task is generally about modeling natural language from that domain. The data is formatted as a continuous stream of text (for language modeling) or, in the case of classification, as input-output pairs where the output is a domain-specific label.
- Example: A continuous document of legal contracts where the model adapts to legal language and structure.
Instruction Fine-Tuning:
- Here, each data point is a clear instruction-output pair. It’s crucial that the model understands the relationship between the instruction and the response.
- Example:
  - Instruction: “Summarize this paragraph.”
  - Output: “The paragraph discusses the impact of climate change on sea levels.”

Step 6: Split the Dataset

Both approaches will require splitting the dataset into training, validation, and test sets. The split is done similarly for both domain adaptation and instruction fine-tuning, though:

Domain Adaptation: Focuses on ensuring the model generalizes well within the domain.
Instruction Fine-Tuning: Focuses on ensuring the model generalizes across a wide range of tasks and instructions.

Step 7: Data Augmentation (Optional)

Domain Adaptation:
- If domain-specific data is scarce, you might augment the dataset by paraphrasing or adding more domain-specific examples (e.g., creating more financial reports).
Instruction Fine-Tuning:
- You may generate more diverse instruction-response pairs by rephrasing instructions, expanding the dataset with variations of similar tasks to ensure the model learns to handle multiple ways of asking the same question.

Step 8: Ensure Dataset Size is Appropriate

Domain Adaptation:
- Domain adaptation can often work well with smaller datasets because the model is just adapting to the specific language or content of a particular domain. However, more data helps if the domain is highly specialized.
Instruction Fine-Tuning:
- Instruction fine-tuning often requires a larger, more diverse dataset, as the goal is to generalize across a wide range of instructions and tasks. The more varied the instructions, the better the model performs at following user commands.

Step 9: Validate the Dataset

Domain Adaptation:
- Validation here involves ensuring the dataset represents the full range of language and terminology used within the specific domain. The goal is to make sure the model can adapt effectively to domain-specific knowledge.
Instruction Fine-Tuning:
- For instruction fine-tuning, validation ensures that the model is learning to follow a diverse set of instructions. You want to ensure the model understands not just the instructions it was trained on but can also generalize to unseen tasks.

Step 10: Document the Dataset

Domain Adaptation:
- Document the domain-specific considerations (e.g., why certain data was included, any terminology used, or legal/medical conventions).
Instruction Fine-Tuning:
- Document the types of instructions and tasks the dataset covers, as well as any specific formatting decisions. This is essential for evaluating whether the model is capable of handling new, unseen instructions.

Summary of Differences:

Domain Adaptation: Focuses on adapting the model to a specific domain by fine-tuning it on relevant text (usually unsupervised or lightly supervised). The model improves its understanding of domain-specific language, terminology, and nuances.
Instruction Fine-Tuning: Focuses on making the model better at following instructions across multiple tasks. The dataset includes explicit instruction-response pairs, and the model learns to generalize across different kinds of instructions.

Both processes share common steps like data collection, cleaning, and validation, but the key difference lies in the structure of the data and the end-goal of the fine-tuning process.

Example : Fraud detection task

Below is an example of a fraud instruction dataset that provides clear instructions to the model to identify or explain potential fraud indicators based on common features. This dataset uses instruction-response pairs where each instruction prompts the model to analyze specific transaction data and detect fraud or explain why certain features are suspicious.

Example: Fraud Instruction Dataset for Fine-Tuning

[
  {
    "instruction": "Analyze this transaction for potential fraud: 'Transaction ID: TXN12345, Amount: $15,000, Type: Wire Transfer, Country: Offshore Country, Device Location: Different from customer residence, Transaction Time: 2:30 AM, Previous Transactions in 24 hrs: 8, Merchant Category: High-risk.'",
    "response": "This transaction is likely to be fraudulent because: 1. Large wire transfer to an offshore account, 2. Device location differs from customer's residence, 3. Transaction occurred at an unusual time (2:30 AM), 4. High frequency of transactions within 24 hours (8), 5. High-risk merchant category."
  },
  {
    "instruction": "Classify the following transaction as fraudulent or legitimate: 'Transaction ID: TXN67890, Amount: $50, Type: Online Purchase, Country: Customer Residence, Device Location: Same as customer residence, Transaction Time: 12:45 PM, Previous Transactions in 24 hrs: 1, Merchant Category: Retail.'",
    "response": "Legitimate. The transaction amount is small, occurs at a reasonable time, and the device location matches the customer's residence. The low number of previous transactions in 24 hours and the typical merchant category (Retail) also indicate normal behavior."
  },
  {
    "instruction": "Explain why this transaction might be flagged as fraudulent: 'Transaction ID: TXN98765, Amount: $9,000, Type: Credit Card, Country: Different from customer’s residence, Device Location: Same as customer residence, Transaction Time: 11:30 PM, Merchant Category: Unusual for this customer.'",
    "response": "This transaction may be flagged as fraudulent due to the following reasons: 1. The large amount is unusual for a credit card transaction, 2. The transaction occurred at a late hour (11:30 PM), 3. The purchase is from a merchant category that is uncommon for the customer."
  },
  {
    "instruction": "Identify potential fraud indicators for this transaction: 'Transaction ID: TXN11223, Amount: $7,500, Type: ATM Withdrawal, Country: Same as customer’s residence, Device Location: ATM location is nearby, Transaction Time: 3:00 AM, Previous Transactions in 24 hrs: 12, Merchant Category: ATM withdrawal.'",
    "response": "Potential fraud indicators: 1. The withdrawal amount is high for an ATM transaction, 2. Unusual transaction time (3:00 AM), 3. High frequency of transactions within the last 24 hours (12), which is abnormal for typical ATM behavior."
  },
  {
    "instruction": "Classify the following transaction based on the features provided: 'Transaction ID: TXN45678, Amount: $500, Type: Online Purchase, Country: Foreign Country, Device Location: Same as customer’s device used before, Transaction Time: 1:00 PM, Previous Transactions in 24 hrs: 0, Merchant Category: Electronics.'",
    "response": "Legitimate. Although the transaction occurred in a foreign country, the device location matches the customer's known device, the transaction time is reasonable, and there were no prior transactions within 24 hours, suggesting typical purchasing behavior."
  },
  {
    "instruction": "Assess the likelihood of fraud for this action: 'Transaction ID: TXN99999, Amount: $20,000, Type: Wire Transfer, Country: Domestic, Device Location: Different from customer’s home location, Transaction Time: 4:00 AM, Previous Transactions in 24 hrs: 0, Merchant Category: Financial Services.'",
    "response": "Likely fraudulent. The large wire transfer amount, the unusual time of transaction (4:00 AM), and the fact that the device location does not match the customer’s home location are all red flags. Additionally, the transaction type (wire transfer) in the financial services category could be part of a money-laundering scheme."
  },
  {
    "instruction": "Is this transaction suspicious based on the following features: 'Transaction ID: TXN55443, Amount: $200, Type: Credit Card Purchase, Country: Customer’s residence, Device Location: Same as customer’s previous device usage, Transaction Time: 10:00 AM, Merchant Category: Retail.'",
    "response": "Not suspicious. The transaction amount is small, the time of day is normal (10:00 AM), and both the device and location match the customer’s previous usage. The purchase from a retail store suggests typical behavior."
  },
  {
    "instruction": "Explain the fraud indicators in this transaction: 'Transaction ID: TXN33321, Amount: $8,500, Type: Bank Transfer, Country: Offshore, Device Location: Matches known device, Transaction Time: 12:15 AM, Merchant Category: Unfamiliar for the customer.'",
    "response": "Fraud indicators include: 1. Large amount transferred offshore, 2. Unusual transaction time (12:15 AM), 3. The merchant category is unfamiliar to the customer, which could be indicative of account compromise."
  },
  {
    "instruction": "Analyze this transaction for potential fraud: 'Transaction ID: TXN77112, Amount: $12,000, Type: Wire Transfer, Country: Same as customer’s residence, Device Location: Different from customer’s residence, Transaction Time: 3:00 PM, Previous Transactions in 24 hrs: 15, Merchant Category: Financial Services.'",
    "response": "This transaction may be fraudulent. The amount is high, and the device location does not match the customer's residence. The high number of previous transactions in a 24-hour period (15) is unusual, especially for large amounts in financial services."
  },
  {
    "instruction": "Summarize the key fraud risk factors in the following case: 'Transaction ID: TXN22456, Amount: $25,000, Type: Business Payment, Country: Offshore, Device Location: Unfamiliar device, Transaction Time: 2:00 AM, Merchant Category: High-risk services.'",
    "response": "Key fraud risk factors: 1. Large transaction amount ($25,000), 2. Offshore payment, 3. Transaction made from an unfamiliar device, 4. Unusual transaction time (2:00 AM), 5. High-risk merchant category. These factors combined suggest a high likelihood of fraud."
  }
]

Key Elements of This Fraud Instruction Dataset:

Instruction-Response Format:
- Each instruction presents a transaction scenario with several features, such as the amount, type, location, time, and history of transactions.
- The model is expected to assess fraud likelihood, classify the transaction as fraudulent or legitimate, or explain why certain features indicate fraud.
Common Fraud Features:
- Transaction Amount: Large, unexpected transactions can be red flags.
- Transaction Time: Unusual transaction times, like very late at night or early in the morning, often indicate suspicious activity.
- Device Location: Transactions made from a location different from the customer’s usual location can indicate compromised accounts.
- Frequency of Transactions: A high number of transactions in a short time frame, especially large ones, is a common sign of fraud.
- Merchant Category: Purchases from high-risk or unfamiliar merchant categories could signal fraudulent activity.
- Country: Transactions from offshore or high-risk countries are often flagged as potential fraud.
Variety of Scenarios:
- The dataset includes both fraudulent and legitimate transactions, which helps the model distinguish between normal and suspicious behaviors.
- Instructions cover various types of transactions: wire transfers, credit card payments, ATM withdrawals, etc.
Explanation-Based Instructions:
- Some instructions ask the model to explain why a transaction might be fraudulent, teaching the model to identify key risk factors in fraud detection.

This type of fraud instruction dataset can be used for fine-tuning models to detect fraud by training them on real-world transaction patterns. The instructions focus on helping the model understand the features that signal potential fraud, improving its decision-making in fraud detection systems.