I am building a federated learning model using my own dataset. I aim to build a multi classification model. The data are presented in separate 8 CSV files.
I followed the instructions in this post As shown in the code below.
dataset_paths = {
'client_0': '/content/ds1.csv',
'client_1': '/content/ds2.csv',
'client_2': '/content/ds3.csv',
'client_3': '/content/ds4.csv',
'client_4': '/content/ds5.csv',
}
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.datasets.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
but it gave me this error
AttributeError: type object 'ClientData' has no attribute 'from_clients_and_fn'
I was reading this documentation and found that .datasets
methods would work, so I replaced with .from_clients_and_fn
and the error disappeared but I dont know if it is right and what is next?
My questions are:
- it this is a right method to upload the data to the clients?
- if it is not possible to upload the CSV files separately, can I combine all of the data into one CSV file and then consider them as a non-IID data and train them accordingly? I need some guidance here
and thanks in advance
CodePudding user response:
In this setup it maybe useful to consider tff.simulation.datasets.FilePerUserClientData
and tf.data.experimental.CsvDataset
.
This might look like (this makes some test CSV data for the sake of the example, the dataset your working with likely has other shapes):
dataset_paths = {
'client_0': '/content/ds1.csv',
'client_1': '/content/ds2.csv',
'client_2': '/content/ds3.csv',
'client_3': '/content/ds4.csv',
'client_4': '/content/ds5.csv',
}
# Create some test data for the sake of the example,
# normally we wouldn't do this.
for i, (id, path) in enumerate(dataset_paths.items()):
with open(path, 'w') as f:
for _ in range(i):
f.write(f'test,0.0,{i}\n')
# Values that will fill in any CSV cell if its missing,
# must match the dtypes above.
record_defaults = ['', 0.0, 0]
@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
return tf.data.experimental.CsvDataset(
dataset_path, record_defaults=record_defaults )
source = tff.simulation.datasets.FilePerUserClientData(
dataset_paths, create_tf_dataset_for_client_fn)
print(source.client_ids)
>>> ['client_0', 'client_1', 'client_2', 'client_3', 'client_4']
for x in source.create_tf_dataset_for_client('client_3'):
print(x)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
It may be possible to concatenate all the data into a single CSV, but each record would still need some identifier indicating which row belongs to which client. Mixing all the rows together without any kind of per-client mapping would be akin to standard centralized training, not federated learning.
Once a CSV has all the rows, and perhaps a column with a client_id
value, one could presumably use tf.data.Dataset.filter()
to only yield the rows belonging to a particular client. This probably won't be particularly efficient though, as it would iterate over the entire global dataset for each client, rather than only that client's examples.