I am looking for a way to merge a Dataset
with another, but by drawing samples from it only occasionally.
For example, given these two Dataset
s
ds1 = tf.data.Dataset.range(1, 10).repeat()
ds10 = tf.data.Dataset.range(10, 100, 10).repeat()
I would like to add samples from ds10
to those of ds1
but only for every two samples, so that the result would be
ds = my_merge(ds1, ds10)
list(ds)
# 11, 2, 23, 4, 35, 6, 47...
Is this possible? I would like to avoid solutions discarding samples from ds10
as this would be inefficient in my case.
EDIT The resulting ds
needs to be a Dataset
so that further input pipeline operations (e.g. batching) can be applied.
CodePudding user response:
Modify ds10
dataset based on skip parameter
skip = 2
pattern = np.concatenate(([0], np.ones((skip-1)))).astype(np.int64)
choice_dataset = tf.data.Dataset.from_tensor_slices((pattern)).repeat()
zeros = tf.data.Dataset.range(0,1).repeat()
ds10 = tf.data.Dataset.choose_from_datasets([ds10, zeros], choice_dataset)
#[10, 0, 20, 0, 30, 0, 40, 0, 50]
Zip and add both dataset values
ds = tf.data.Dataset.zip((ds1,ds10))
ds = ds.map(lambda x,y:x y)
#[11, 2, 23, 4, 35, 6, 47, 8, 59]
CodePudding user response:
You can create your own generator:
import tensorflow as tf
from functools import partial
ds1_unrepeated = tf.data.Dataset.range(1, 10) # because repeat prevents element_spec
ds1_spec = ds1_unrepeated.element_spec
ds1 = ds1_unrepeated.repeat()
ds10 = tf.data.Dataset.range(10, 100, 10).repeat()
def my_merge(iter1,iter2):
sliced_iter2 = iter(iter2)
sliced_iter1 = iter(iter1)
while True:
yield next(sliced_iter1) next(sliced_iter2)
yield next(sliced_iter1)
ds = tf.data.Dataset.from_generator(partial(my_merge,ds1,ds10),output_signature=ds1_spec)
for element in ds:
print(element)
tf.Tensor(11, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(23, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(35, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(47, shape=(), dtype=int64)
Edit: I have updated it to be a dataset, but I think the answer at the top is more efficient, this answer is only if the answer should be as lazily evaluated as possible, with little knowledge about the inputs, ie: the merging can be arbitrarily complex.