Set of Data classes

Ivory uses four classes for data presentation: Data, Dataset, Datasets, and DataLoaders. In this tutorial, we use the following Python module to explain them.

File 5 rectangle/data.py

from dataclasses import dataclass

import numpy as np

import ivory.core.data
from ivory.utils.fold import kfold_split


def create_data(num_samples=1000):
    xy = 4 * np.random.rand(num_samples, 2) + 1
    xy = xy.astype(np.float32)
    dx = 0.1 * (np.random.rand(num_samples) - 0.5)
    dy = 0.1 * (np.random.rand(num_samples) - 0.5)
    z = ((xy[:, 0] + dx) * (xy[:, 1] + dy)).astype(np.float32)
    return xy, z


@dataclass(repr=False)
class Data(ivory.core.data.Data):
    n_splits: int = 4

    DATA = create_data(1000)  # Shared by each run.

    def init(self):  # Called from self.__post_init__()
        self.input, self.target = self.DATA
        self.index = np.arange(len(self.input))
        # Extra fold for test data.
        self.fold = kfold_split(self.input, n_splits=self.n_splits + 1)

        # Creating dummy test data just for demonstration.
        is_test = self.fold == self.n_splits  # Use an extra fold.
        self.fold[is_test] = -1  # -1 for test data.
        self.target = self.target.copy()  # n_splits may be different among runs.
        self.target[is_test] = np.nan  # Delete target for test data.

        self.target = self.target.reshape(-1, 1)  # (sample, class)


def transform(mode, input, target):
    return input, target.reshape(-1)

Data Class

First import the module and check the basic behavior.

import rectangle.data

data = rectangle.data.Data()
data

[2] 2020-06-20 15:23:40 (5.00ms) python3 (7.21s)

Data(train_size=800, test_size=200)

In Data.init(), we need to define 4 attributes:

  • index: Index of samples.
  • input: Input data.
  • target: Target data.
  • fold: Fold number.

Data.get() returns a tuple of (index, input, target). This function is called from Dataset instances when the dataset is indexed.

data.get(0)  # Integer index.

[3] 2020-06-20 15:23:40 (5.00ms) python3 (7.21s)

(0,
 array([2.0772183, 4.9461417], dtype=float32),
 array([10.418284], dtype=float32))
data.get([0, 10, 20])  # Array-like index. list or np.ndarray

[4] 2020-06-20 15:23:40 (4.00ms) python3 (7.22s)

(array([ 0, 10, 20]),
 array([[2.0772183, 4.9461417],
        [3.9508767, 1.6149015],
        [1.3982536, 2.687021 ]], dtype=float32),
 array([[10.418284 ],
        [ 6.441745 ],
        [ 3.8166006]], dtype=float32))

Dataset Class

An instance of the Dataset class holds one of train, validation, and test dataset. We use the Ivory's default Dataset here instead of defining a subclass. Dataset() initializer requires three arguments: A Data instance, mode, and fold.

import ivory.core.data

dataset = ivory.core.data.Dataset(data, 'train', 0)
dataset

[5] 2020-06-20 15:23:40 (4.00ms) python3 (7.22s)

Dataset(mode='train', num_samples=600)
ivory.core.data.Dataset(data, 'val', 1)  # Another mode is `test`.

[6] 2020-06-20 15:23:40 (3.00ms) python3 (7.23s)

Dataset(mode='val', num_samples=200)

As the Data, the Dataset has init() without any arguments and returned value. You can define any code to modify data.

To get data from an Dataset instance, use normal indexing

dataset[0]  # Integer index.

[7] 2020-06-20 15:23:40 (4.00ms) python3 (7.23s)

(0,
 array([2.0772183, 4.9461417], dtype=float32),
 array([10.418284], dtype=float32))
dataset[[0, 10, 20]]  # Array-like index. list or np.ndarray

[8] 2020-06-20 15:23:40 (4.00ms) python3 (7.23s)

(array([ 0, 16, 33]),
 array([[2.0772183, 4.9461417],
        [4.3623295, 3.6915543],
        [1.6263498, 4.103133 ]], dtype=float32),
 array([[10.418284],
        [15.843957],
        [ 6.581139]], dtype=float32))
index, *_ = dataset[:]  # Get all data.
print(len(index))
index[:10]

[9] 2020-06-20 15:23:40 (5.00ms) python3 (7.24s)

600
array([ 0,  2,  3,  4,  6,  7, 10, 12, 13, 15])

These data come from a subset of the Data instance according to the mode and fold.

The Dataset takes an optional and callable argument: transform.

def transform(mode: str, input, target):
    if mode == 'train':
        input = input * 2
        target = target * 2
    return input, target

dataset_transformed = ivory.core.data.Dataset(data, 'train', 0, transform)
dataset_transformed[0]

[10] 2020-06-20 15:23:40 (5.00ms) python3 (7.24s)

(0,
 array([4.1544366, 9.892283 ], dtype=float32),
 array([20.836569], dtype=float32))
2 * dataset[0][1], 2 * dataset[0][2]

[11] 2020-06-20 15:23:40 (4.00ms) python3 (7.25s)

(array([4.1544366, 9.892283 ], dtype=float32),
 array([20.836569], dtype=float32))

Usually, we don't instantiate the Dataset directly. Instead, the Datasets class create dataset instances.

Datasets Class

An instance of the Datasets class holds a set of train, validation, and test dataset. We use the Ivory's default Datasets here instead of defining a subclass. The Datasets() initializer requires three arguments: A Data instance, Dataset factory, and fold.

from ivory.core.data import Dataset

datasets = ivory.core.data.Datasets(data, Dataset, 0)
datasets

[12] 2020-06-20 15:23:40 (4.00ms) python3 (7.25s)

Datasets(data=Data(train_size=800, test_size=200), dataset=<class 'ivory.core.data.Dataset'>, fold=0)

Note

The second argument (dataset) is not a Dataset instance but its factory that returns a Dataset instance. It may be a Dataset itself or any other function that returns a Dataset instance.

for mode, dataset in datasets.items():
    print(mode, dataset)

[13] 2020-06-20 15:23:40 (10.0ms) python3 (7.26s)

train Dataset(mode='train', num_samples=600)
val Dataset(mode='val', num_samples=200)
test Dataset(mode='test', num_samples=200)

Each dataset can be accessed by indexing or attributes.

datasets['train'], datasets.val

[14] 2020-06-20 15:23:40 (5.00ms) python3 (7.27s)

(Dataset(mode='train', num_samples=600), Dataset(mode='val', num_samples=200))

Using the Datasets, we can easily split a whole data stored in a Data instance into three train, validation, and test dataset.

DataLoaders Class

The DataLoaders class is used internally by ivory.torch.trainer.Trainer or ivory.nnabla.trainer.Trainer classes to yield a minibatch in training loop.

from ivory.torch.data import DataLoaders

dataloaders = DataLoaders(datasets, batch_size=4, shuffle=True)
dataloaders

[15] 2020-06-20 15:23:40 (4.00ms) python3 (7.27s)

DataLoaders(['train', 'val', 'test'])
for mode, dataloader in dataloaders.items():
    print(mode, dataloader)

[16] 2020-06-20 15:23:40 (8.00ms) python3 (7.28s)

train <torch.utils.data.dataloader.DataLoader object at 0x00000140518D5408>
val <torch.utils.data.dataloader.DataLoader object at 0x00000140518D59C8>
test <torch.utils.data.dataloader.DataLoader object at 0x00000140518D5448>
next(iter(dataloaders.train))  # Shuffled

[17] 2020-06-20 15:23:40 (13.0ms) python3 (7.29s)

[tensor([110, 857,   0,  21], dtype=torch.int32),
 tensor([[3.0064, 3.9603],
         [1.2233, 2.2332],
         [2.0772, 4.9461],
         [4.4023, 2.1658]]),
 tensor([[11.8943],
         [ 2.7360],
         [10.4183],
         [ 9.3503]])]
next(iter(dataloaders.val))  # Not shuffled, regardless of `shuffle` argument

[18] 2020-06-20 15:23:40 (4.00ms) python3 (7.30s)

[tensor([ 1,  5,  8, 14], dtype=torch.int32),
 tensor([[1.9011, 4.9406],
         [2.8263, 2.8066],
         [2.2066, 3.6569],
         [4.0281, 1.4676]]),
 tensor([[9.5829],
         [8.0773],
         [8.1560],
         [5.7046]])]
next(iter(dataloaders.test))  # Not shuffled, regardless of `shuffle` argument

[19] 2020-06-20 15:23:40 (4.00ms) python3 (7.30s)

[tensor([ 9, 11, 19, 23], dtype=torch.int32),
 tensor([[4.3053, 4.1206],
         [1.8368, 3.9429],
         [3.0406, 3.5035],
         [3.2625, 4.5660]]),
 tensor([[nan],
         [nan],
         [nan],
         [nan]])]