Gen AI Guide > Datasets > Exercise#2 Create & publish dataset

Exercise#2 Create & publish dataset

Create a dataset with ’train’ & ‘validation’ split for model pre-training experimentation
The total number of rows in the dataset will be miniscule part of the dataset wikimedia/wikipedia. The size of the original dataset is roughly 72 GB
In the original dataset there are multiple configs, each for different language. You will use just the config wikipedia/20220301.en

Load the dataset wikipedia/20220301.en
Create a split with 0.01% of the original data in the ’train’ split
- We will use ONLY the ’test’ split i.e., 0.01% of original data
Pre-process : Break wiki paragraphs into sentences
- Each row in the original dataset is a large text blob (one or more paragraphs)
- Sentences from the same paragraph will have common attributes [id, url, title]
- Do NOT shuffle the dataset
Split the dataset into [train, validation] with [90%, 10%] split
Upload dataset to HF e.g., I pushed it to acloudfan/wikismall

Solution

Checkout the final dataset at acloudfan/wikismall

Datasets/wikismall/create-wikismall-dataset.ipynb

ex-2-publish-datasets