You work for a bank and are responsible for building an API that will be invoked to detect fraudulent credit card transactions in real time. To prove the feasibility, you decided to do a PoC. The challenge is that you do not have access to real data for compliance reasons.
In this project you will create a balanced dataset for fine-tuning an LLM for fraud detection.
Part-1: Generate the dataset in JSON format
Part-2: Convert data to JSON line format and split
Part-3: Balance the dataset
Fine-tuning: (Optional) Check out instructions on the page titled Fine-tune for fraud detection
In this synthetic dataset, transactions are labeled as fraudulent based on a few simplified criteria that might indicate unusual or suspicious activity in real-world scenarios. Here are the criteria used:
This dataset is for experimentation only. Fraud detection models are very effectively built with classical machine learning model. LLMs at the moment are an area of interest for this use case. In a real-world scenario, you will work with a data and business expert to identify the parameters.
High Transaction Amount: Large purchases, especially for luxury items like electronics or jewelry, are often flagged in fraud detection systems as they’re more likely to indicate fraudulent activity.
Unusual Merchant Types: Certain types of merchants (e.g., jewelry or electronics stores) are more commonly associated with high-value, high-risk transactions, which may raise a red flag.
Foreign Transactions: Transactions occurring in a location that differs from the user’s normal spending pattern, especially international transactions, are flagged as suspicious.
Unusual Timing: Odd hours for transactions may increase the likelihood of a fraud label, especially if the transaction does not align with typical spending times.
Location: If transaction location is a country different from the customer’s country then there is a highly likelihood of fraud.
In real fraud detection, these simplified indicators would be part of a larger set of signals, including historical user behavior, spending patterns, device and network information, and real-time risk scoring to accurately determine fraud.
I need a synthetic dataset for credit card transactions, with details on whether each transaction is potentially fraudulent. The dataset should include diverse records with fields such as Transaction ID, Amount, Merchant Type, Location (city and country), Transaction Time, Device Type, Customer Country, Customer State, Transaction Label (Fraud or Not Fraud), and a Comment.
For each row:
- Generate a unique **Transaction ID**.
- Choose an **Amount** value, varying from low (e.g., below $100) to high (e.g., above $1,000).
- **Merchant Types** should include categories like Groceries, Electronics, Restaurants, Jewelry, and Online Retail.
- **Location** should be specific, including city and country (e.g., “New York, USA” or “Tokyo, Japan”).
- **Transaction Time** should include both date and time.
- For **Device Type**, pick from options like Mobile, Desktop, and Tablet.
- Include **Customer Country** and **Customer State**.
- Label each transaction as **Fraud** or **Not Fraud** based on the pattern of spending, geography, and amount:
- Mark a transaction as "Fraud" if it’s a high amount in a location far from the customer's usual country/state.
- Mark as "Not Fraud" for typical transactions (e.g., within usual geography, lower amounts).
- Include a **Comment** explaining the reason for the label (e.g., “unusually high amount outside customer’s region” or “typical transaction in customer’s area”).
Generate 5 example records in the following JSON format:
{
"transaction_id": "...",
"amount": ...,
"merchant_type": "...",
"location": "...",
"transaction_time": "...",
"Device Type": "...",
"customer_country": "...",
"customer_state": "...",
"transaction_label": "not_fraud | fraud",
"comment": "..."
}
The instructions in this project are using ChatGPT but you may use other LLMs such as Gemini, Claude as well.
[Template project]/Fine-Tuning/data/synthetic-credit-card-fraud/credit-card-fraud-chatgpt.json
Do not close the chat session as you will use it for generating additional examples for balancing the dataset.
You may need to carry out this step multiple times due to the LLM’s max output token restrictions. You may need to stich the rows together to create the dataset. At the end of it you should have a single JSON file with a single array.
[
{
"transaction_id": "TXN00123456",
"amount": 85.50,
"merchant_type": "Groceries",
"location": "Chicago, USA",
"transaction_time": "2024-10-28 14:32:15",
"Device Type": "Mobile",
"customer_country": "USA",
"customer_state": "IL",
"transaction_label": "not_fraud",
"comment": "Typical grocery purchase within customer's state"
},
{
"transaction_id": "TXN00123457",
"amount": 1200.75,
"merchant_type": "Jewelry",
"location": "Paris, France",
"transaction_time": "2024-10-28 09:15:42",
"Device Type": "Desktop",
"customer_country": "USA",
"customer_state": "CA",
"transaction_label": "fraud",
"comment": "High amount spent internationally, unusual for customer location"
}
]
The JSON data file is already provided in the project repository, if you prefer to move to next step without following the steps above. You may copy the file to your subfolder from the project repository from folder Fine-Tuning/data/synthetic-credit-card-fraud/credit-card-fraud-chatgpt.json to your Fine-Tuning/data/synthetic-credit-card-fraud folder
Start by creating a new notebook under the folder:
[Template project]/Fine-Tuning/data/prepare-synthetic-credit-card-fraud.ipynb
Review the code and copy/paste to your notebook. Run the code and it will generate 3 files with data in JSONL format.
import json
# This file has the synthetic data in JSON [ ] format
j_file = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt.json"
with open(j_file) as f:
dat = json.load(f)
# JSON array is converted to json line format
def write_jsonl_file(dat_subset, file_name):
jsonl = ""
for rec in dat_subset:
jsonl = jsonl + json.dumps(rec) + "\n"
with open(file_name, "w") as f:
f.write(jsonl)
print(file_name, "# of lines : ", len(dat_subset))
# Train - split
output_file_prefix = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-"
file_name = output_file_prefix+"train.jsonl"
write_jsonl_file(dat[0:56], file_name)
# Validation - split
file_name = output_file_prefix+"validate.jsonl"
write_jsonl_file(dat[56:70], file_name)
# Test - split
file_name = output_file_prefix+"test.jsonl"
write_jsonl_file(dat[70:], file_name)
At the end of this step, you will know if additional data needs to be generated for balancing the dataset.
# Get the counts for fraud & not_fraud examples
training_file_name = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train.jsonl"
# Count fraud vs not fraud examples in the training file
def get_training_dataset_distribution(data_file_name):
fraud_count = 0
not_fraud_count = 0
with open(data_file_name) as f:
for line in f:
if json.loads(line)["transaction_label"] == "fraud":
fraud_count = fraud_count + 1
else:
not_fraud_count = not_fraud_count + 1
# Calculate % of examples labeled as Fraud
fraud_pct = int(fraud_count*100/(fraud_count + not_fraud_count))
print("Fraud labels : ", fraud_pct, "% ")
print("Not_Fraud labels : ", (100-fraud_pct), "% ")
return fraud_count, not_fraud_count
# Check the balance
fraud_count, not_fraud_count = get_training_dataset_distribution(training_file_name)
Determine the number of additional examples for a label.
# Check number of additional examples to be generated
if (fraud_count - not_fraud_count) > 0:
print("Augmentation suggested. add examples for 'Not Fraud':", (fraud_count - not_fraud_count))
elif (fraud_count - not_fraud_count) < 0:
print("Augmentation suggested. add examples for 'Fraud':", (not_fraud_count - fraud_count))
else:
print("Dataset is balanced")
# JSON file with additional examples
j_file_additional = './synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train-additional.json'
# Open the file and read the JSON array data
with open(j_file_additional) as f:
additional_dat = json.load(f)
# Print count of additional examples for validation
print( "# of additional examples : ", len(additional_dat))
# Convert JSON array to JSON Line
jsonl = ""
for rec in additional_dat:
jsonl = jsonl + json.dumps(rec) + "\n"
# Open the credit-card-fraud-chatgpt-train.json and append the augmentation examples to it
training_file_name = "./synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train.jsonl"
with open(training_file_name) as training_file:
original_training_dat = training_file.read()
output_train_file = './synthetic-credit-card-fraud/credit-card-fraud-chatgpt-train-augmented.jsonl'
with open(output_train_file, "w") as f:
f.write(original_training_dat)
f.write(jsonl)
# Check if the dataset is now balanced
get_training_dataset_distribution(output_train_file)