How to add datasets to huggingface

This mostly chronicles me adding in xksteven/dialogue_nli to huggingface. I never found the docs intuitive how huggingface itself was essentially a git platform but by having a different UI I had difficulty finding the link to even download to begin.

Add an ssh key to huggingface

You will need to add an ssh key to huggingface to be able to upload files via git.

ssh-keygen -t ed25519 -C "your.email@example.co"
ssh-add ~/.ssh/id_ed25519

Then add your public key i.e. ~/.ssh/id_ed25519.pub to your ssh user settings

Make a new dataset in huggingface

Make a hugginface profile (if you don’t already have one). Click on the top right and click New Dataset give it a name and select the license.

Go to the dataset card and you should be able to run

git clone https://huggingface.co/datasets/xksteven/dialogue_nli
#git remote set-url origin git@hf.co:<repo_path>
git remote set-url origin git@hf.co:datasets/xksteven/dialogue_nli

repo_path is of the form

  • / for models
  • datasets// for datasets
  • spaces// for Spaces

For adding large files (anything bigger than 5MB) you’ll need to install git-lfs. Then add them like so:

git lfs track dnli/dialogue_nli/dialogue_nli_train.jsonl
# or like 
# git lfs track *.ogg

Testing

Try loading your dataset.

from datasets import load_dataset

# Path to your local dataset directory
dataset = load_dataset('xksteven/dialogue_nli', split='train')
print(dataset)

If it fails to be automatically processed you will probably need to add a dataset file.

Here’s a template

test it out locally with

from datasets import load_dataset

# Path to your local dataset directory
dataset = load_dataset('/path/to/local/dir', split='train')
print(dataset)

Note that all of the names must match. i.e. the dataset name on huggingface is dialogue_nli so the python loader file must be called dialogue_nli.py.