I'm testing this tutorial with non-IID distribution for federated learning: https://www.tensorflow.org/federated/tutorials/tff_for_federated_learning_research_compression
In this posted question TensorFlow Federated: How to tune non-IIDness in federated dataset? it suggested to use tff.simulation.datasets.build_single_label_dataset() as a way to produce a non-IID distribution for the dataset.
I tried to apply that first (see the code) and got an error !
emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data(
only_digits=False)
emnist_train1 = tff.simulation.datasets.build_single_label_dataset(
emnist_train.create_tf_dataset_from_all_clients(),
label_key='label', desired_label=1)
print(emnist_train1.element_spec)
OrderedDict([('label', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None))])
print(next(iter(emnist_train1))['label'])
tf.Tensor(1, shape=(), dtype=int32)
MAX_CLIENT_DATASET_SIZE = 418
CLIENT_EPOCHS_PER_ROUND = 1
CLIENT_BATCH_SIZE = 20
TEST_BATCH_SIZE = 500
def reshape_emnist_element(element):
return (tf.expand_dims(element['pixels'], axis=-1), element['label'])
def preprocess_train_dataset(dataset):
return (dataset
.shuffle(buffer_size=MAX_CLIENT_DATASET_SIZE)
.repeat(CLIENT_EPOCHS_PER_ROUND)
.batch(CLIENT_BATCH_SIZE, drop_remainder=False)
.map(reshape_emnist_element))
emnist_train1 = emnist_train1.preprocess(preprocess_train_dataset)
>> ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-cda96c33a0f6> in <module>()
15 .map(reshape_emnist_element))
16
---> 17 emnist_train1 = emnist_train1.preprocess(preprocess_train_dataset)
AttributeError: 'MapDataset' object has no attribute 'preprocess'
Since dataset is filtered, it is not able to preprocess! So, in this case, it is filtered based on what label?
... label_key='label', desired_label=1)
the desired label = 1 for which label in EMNIST?
My Question is:
How can I apply this function tff.simulation.datasets.build_single_label_dataset() to get non-IID dataset (different number of samples for each client) in this specific tutorial ! https://www.tensorflow.org/federated/tutorials/tff_for_federated_learning_research_compression in details without error regarding the filtered dataset!
Appreciate any help!
Thanks a lot!
CodePudding user response:
Possibly there is some confusion between the tff.simulation.datasets.ClientData
and tf.data.Dataset
APIs that would be useful to cover.
tf.data.Dataset
does not have a preprocess
method, with tff.simulation.datasets.ClientData.preprocess
does exist.
However, tff.simulation.datasets.build_single_label_dataset
uses tf.data.Dataset
instances: both the input argument and the output result as tf.data.Dataset
instances. In this case, emnist_train1
is a tf.data.Dataset
which does not have a preprocess
method.
However, all is not lost! The preprocess_train_dataset
function takes a tf.data.Dataset
argument, and returns a tf.data.Dataset
result. This should mean that replacing:
emnist_train1 = emnist_train1.preprocess(preprocess_train_dataset)
with
emnist_train1 = preprocess_train_dataset(emnist_train1)
will create a tf.data.Dataset
with only a single label ("label non-IID") that is shuffled, repeated, batched, and reshaped. Note that a single tf.data.Dataset
is generally used to represent one user in the federated algorithm. To create more, with a random number of batches, something like the following could work:
client_datasets = [
emnist_train1.take(random.randint(1, MAX_BATCHES))
for _ in range(NUM_CLIENTS)
]