TensorFlow Dataset API Exploration Part 1

I’ve been working on a small project of mine in TensorFlow (TF) recently when I noticed how increasingly annoyed I get every time I need to write some data pipeline code. Dealing with TF queues, external transofmration libraries or whatnot felt a bit clunky and a bit too low level to my liking, at least as far as the API abstractions are concerned, anyways.

Keeping en eye on TF development, I’ve been aware of the tf.contrib.data.Dataset API which has seen a steady stream of improvements over the past year. I played around with it on and off, but never really got any of my code ported to it. I finally decided to take that step, now that the Dataset API was promoted to stable TF APIs. This post is Part 1 of 2 parts series which explore the Dataset API and highlight some things that either grasped my curiosity or caught me by surprise when I experimented with it.

This post is by no means a comprehensive tutorial. I decided to write it for my own reference so I have it documented all in one place in case I need it. If you find this post useful or notice anything which is not correct, please don’t hesitate to leave a comment below.

Dataset API quick rundown

Based on the official TF programmer’s guide, Dataset API introduces two abstractions: tf.data.Dataset and tf.data.Iterator. The former represents a sequences of raw data elements (such as images etc.) [and/or] their transformations, the latter provides a way how to extract them in various ways depending on your needs. In laymen terms: Dataset provides data source, Iterator provides a way to access it.

Creating Datasets

You can create a new Dataset either from data stored in memory, using tf.data.Dataset.from_tensor_slices() or tf.data.Dataset.from_tensors(), or you can create it from files stored no your disk as long as the files are encoded in TFRecord format. Let’s ignore the on-disk option and have a look at from_tensors() and from_tensor_slices() methods. These are the methods I have found myself using the most frequently.

from_tensor_slices()

According to documentation from_tensor_slices() expects “a nested structure of tensors, each having the same size in the 0th dimension” and returns “a Dataset whose elements are slices of the given tensors”. If you are as dumb as me, you’re probably a bit puzzled as how to use this method.

The programmer’s guide shows a couple of simple examples to get you started. The function argument in these examples is either a single tensor or a tuple of tensors. Let’s ignore the single tensor case and let’s have a look at the tuple one:

This code will create a dataset that allows to read the data as a tuple of tensors with the following shapes:

This can be useful for “sourcing” data with labels (1st element of the returned tuple) and its raw data (2nd element). Note how the 0-th dimension becomes irrelevant here: 0-th dimension will be specified by Iterator as we will see in the Part 2 post whcih will deal with it. For now, remember when using from_tensor_slices() what matters is the data dimension and not a number of samples in the sourced data. Like I said, this should become more obvious once demonstrated with Iterator.

If you are like me, you might wonder what happens if you passed in the same tensors as a list as opposed to a tuple i.e. what Dataset gets created using the following code:

If you run the above code you will get an exception saying:

This suggests that TF does some kind of type safe tensor list “merge”. If you change the second tensor dtype to tf.float32 and try to create the Dataset again, you get a different exception:

I had a quick look at TF code and noticed that TF does some kind of tensor flattening and “merging” and since the tensors in the function argument list have different ranks it can’t proceed so it spews the exception. This does not happen if you pass the tensors in as a tuple likr we demonstrated in the first example. This also sort of confirms my initial theory of the type safe “merge”.

Let’s move on and have a look at what happens when we pass in a tuple of tensors of the same shape:

Now that the passed in tensors have the same shape, TF doesn’t throw any exceptions and creates a Dataset which allows to read the data as a tuple of tensors of the same shape as the ones supplied as a funcion argument:

We can read the data a tuple of tensors with the same dimensions. This can be handy if you source data and its transformation (eg. some kind of variational translation) in the same input stream.

Now, let’s have a look at what happens when we pass in a list of tensors of the same shape:

What I expected was something similar to the tuple case i.e. I expected a Dataset which would allow to read data in a tuple of tensors, however what TF does in this case is, it creates a Dataset which allows to read a data as a single tensor of the exact same shape as both passed in tensors:

This is different from the tuple case: notice how in this case you read data tensors with predefined shape in both dimensions, whilst in the tuple case the number of read tensors is driven by the Iterator. For now, just take a leap of trust with me and don’t think about it too hard. It will be demonstrated on practical examples in the second part of this series.

The best thing at the end: you can actually name your datasets. The below code is from the official programmers guide, but I’m adding it here for reference:

You will get the following Dataset as a result:

You can now read your data as dictionaries - super handy for code readability!

from_tensors()

In comparison to from_tensor_slices(), from_tensors() method does not expect its arguments to have the same 0-th dimension. Let’s walk through the same set of examples as earlier.

We’ll start with creating Dataset from a tuple of tensors of different dimensions:

We get a dataset which lets us read the data as a tuple, however, the tensors in the returned tuple are different from the from_tensor_slices() case:

Notice how the dimensions differ from the the dataset returned by from_tensor_slices():

• from_tensors(): (4,), (4, 10)
• from_tensor_slices(): (), (10,)

I haven’t really found a use case for this case, but maybe there is one.

Just like in the from_tensor_slices() if we try to create Dataset using from_tensors() by passing in the tensors of different shapes in a list we will get the same error:

Now, let’s pass in a tuple of tensors of the same shape. What we get back is the following Dataset:

TF returns a Dataset which allows to read data as tuples of 2D tensors of the exact same shape (in both dimensions) as the tensors passed in the function argument:

Once again, notice how this differs from the from_tensor_slices() case:

• from_tensors(): (4, 10), (4, 10)
• from_tensor_slices(): (10,), (10,)

What’s really interesting, though, is how from_tensors() handles the case when you pass in a list of tensors of the same shape :

In this case TF will create a Dataset of 3D tensors. It seems that TF “concatenates” the passed in tensors into one large 3D tensor:

This can be handy if you have different sources of different image channels and want to concatenate them into a one RGB image tensor.

This sums up the Part 1 of the Dataset API explration in the next part we will look at some of the functionality of Iterators.

Summary

tf.data.Dataset provides a nice and concise way of creating data pipelines that can be consumed through tf.data.Iterator . We have discussed different methods of reading data from tensors stored memory. This blog post demonstrated some, to me, unknown or surprising behavior of the Dataset API. In the next part we will look at Iterator API, which should make Part 1 more obvious and understandable. Stay tuned!