tff.simulation.datasets.ClientData to build federated learning model from CSV files-CodePudding

I am building a federated learning model using my own dataset. I aim to build a multi classification model. The data are presented in separate 8 CSV files.

I followed the instructions in this post As shown in the code below.

dataset_paths = {
  'client_0': '/content/ds1.csv',
  'client_1': '/content/ds2.csv',
  'client_2': '/content/ds3.csv',
  'client_3': '/content/ds4.csv',
  'client_4': '/content/ds5.csv',
}

def create_tf_dataset_for_client_fn(id):
   path = dataset_paths.get(id)
   if path is None:
     raise ValueError(f'No dataset for client {id}')
   return tf.data.Dataset.TextLineDataset(path)

source = tff.simulation.datasets.ClientData.from_clients_and_fn(
  dataset_paths.keys(), create_tf_dataset_for_client_fn)

but it gave me this error

AttributeError: type object 'ClientData' has no attribute 'from_clients_and_fn'

I was reading this documentation and found that .datasets methods would work, so I replaced with .from_clients_and_fn and the error disappeared but I dont know if it is right and what is next?

My questions are:

it this is a right method to upload the data to the clients?
if it is not possible to upload the CSV files separately, can I combine all of the data into one CSV file and then consider them as a non-IID data and train them accordingly? I need some guidance here

and thanks in advance

CodePudding user response：

In this setup it maybe useful to consider tff.simulation.datasets.FilePerUserClientData and tf.data.experimental.CsvDataset.

This might look like (this makes some test CSV data for the sake of the example, the dataset your working with likely has other shapes):

dataset_paths = {
  'client_0': '/content/ds1.csv',
  'client_1': '/content/ds2.csv',
  'client_2': '/content/ds3.csv',
  'client_3': '/content/ds4.csv',
  'client_4': '/content/ds5.csv',
}

# Create some test data for the sake of the example,
# normally we wouldn't do this.
for i, (id, path) in enumerate(dataset_paths.items()):
  with open(path, 'w') as f:
    for _ in range(i):
      f.write(f'test,0.0,{i}\n')

# Values that will fill in any CSV cell if its missing,
# must match the dtypes above.
record_defaults = ['', 0.0, 0]

@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
   return tf.data.experimental.CsvDataset(
     dataset_path, record_defaults=record_defaults )

source = tff.simulation.datasets.FilePerUserClientData(
  dataset_paths, create_tf_dataset_for_client_fn)


print(source.client_ids)
>>> ['client_0', 'client_1', 'client_2', 'client_3', 'client_4']

for x in source.create_tf_dataset_for_client('client_3'):
  print(x)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)
>>> (<tf.Tensor: shape=(), dtype=string, numpy=b'test'>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, <tf.Tensor: shape=(), dtype=int32, numpy=3>)

It may be possible to concatenate all the data into a single CSV, but each record would still need some identifier indicating which row belongs to which client. Mixing all the rows together without any kind of per-client mapping would be akin to standard centralized training, not federated learning.

Once a CSV has all the rows, and perhaps a column with a client_id value, one could presumably use tf.data.Dataset.filter() to only yield the rows belonging to a particular client. This probably won't be particularly efficient though, as it would iterate over the entire global dataset for each client, rather than only that client's examples.