Dataset gets re-copied to GPU (causing out of memory) when calling to eval twice-CodePudding

This is my bunch of code:

# I train a model, save it and then clear all with
del model
tf.keras.backend.clear_session()
gc.collect()
print(f"memory usage {tf.config.experimental.get_memory_info('GPU:0')['current'] / 10 ** 9} GB")
checkpoint_model = open_saved_model()    # returns a tf.keras.Model()
print(f"memory usage {tf.config.experimental.get_memory_info('GPU:0')['current'] / 10 ** 9} GB")
eval_result = checkpoint_model.evaluate(train_ds[0], train_ds[1], batch_size=30)
print(f"memory usage {tf.config.experimental.get_memory_info('GPU:0')['current'] / 10 ** 9} GB")
eval_result = checkpoint_model.evaluate(train_ds[0], train_ds[1], batch_size=30)

The memory outputs are:

memory usage 0.0 GB
memory usage 0.013005312 GB
memory usage 5.893292544 GB

And on the last line I get tensorflow.python.framework.errors_impl.InternalError (full message at the end)

My train dataset is supposed to be train_ds[0].size * train_ds[0].itemsize / 10**9 = 4.395368448 GB.

My GPU available size (using nvidia-smi command) is 10481MiB / 11016MiB. If I do the used memory plus the numpy array I get 10.27146624 which is borderline to the 10.48 that tensorflow decided to allocate. Even more, although it reserved 10GB, there is a message (see full error message at the end) that it has 8GB memory (weird but it explains why I'm out of memory).

Regardless of this borderline result, it seems super wrong that the dataset is allocated AGAIN. I should either re-use the dataset used in evaluate or just replace it with the new one.

I tried using train_dataset = tf.data.Dataset.from_tensor_slices((train_ds[0], train_ds[1])).batch(32) and the MWE worked (with an increase of memory usage to 7.35GB) but if I change the second evaluate with a predict (which is actually my real goal) I then get the same error.

I read about using os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async" and then I just get a Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) without any error message. But comparing it to the other messages, it stops before the message Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8965 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5 meaning that I think it fails to "create device".

MOTIVATION

This was the MWE I managed to re-create but the truth is I want to evaluate and predict on MANY datasets that should be each 5GB size. The current solution for me should be:

Clear all GPU
Load model
Evaluate
Clear all GPU
Load model again
Predict

And then repeat steps 1 to 6 for my several datasets (Highly inefficient right?).

Full error message

2022-04-06 13:24:49.708029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.713198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.713526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.713988: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-06 13:24:49.714414: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.714715: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:49.715002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.044152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.044479: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.044766: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-06 13:24:50.045036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8965 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5
memory usage 0.0 GB
2022-04-06 13:25:00.250155: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-04-06 13:25:00.250170: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-04-06 13:25:00.250192: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2022-04-06 13:25:00.250349: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.2'; dlerror: libcupti.so.11.2: cannot open shared object file: No such file or directory
2022-04-06 13:25:00.356485: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-04-06 13:25:00.356639: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2022-04-06 13:25:00.372969: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
2022-04-06 13:25:03.200488: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/utils/generic_utils.py:494: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
  warnings.warn('Custom mask layers require a config and must override '
2022-04-06 13:25:05.075473: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-04-06 13:25:07.796065: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8303
2022-04-06 13:25:08.177722: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-04-06 13:25:08.177947: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-04-06 13:25:08.177972: W tensorflow/stream_executor/gpu/asm_compiler.cc:77] Couldn't get ptxas version string: Internal: Couldn't invoke ptxas --version
2022-04-06 13:25:08.178231: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-04-06 13:25:08.178262: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
  1/187 [..............................] - ETA: 11:00 - loss: 0.8855 - accuracy: 0.3187 - average_accuracy: 0.2666 - precision: 0.3264 - recall: 0.00222022-04-06 13:25:08.716360: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-04-06 13:25:08.716379: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
  2/187 [..............................] - ETA: 1:05 - loss: 0.8388 - accuracy: 0.3063 - average_accuracy: 0.2759 - precision: 0.3169 - recall: 0.0049 2022-04-06 13:25:09.011233: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-04-06 13:25:09.011432: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2022-04-06 13:25:09.040157: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673]  GpuTracer has collected 705 callback api events and 707 activity events. 
2022-04-06 13:25:09.049283: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-04-06 13:25:09.061327: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09

2022-04-06 13:25:09.071522: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.trace.json.gz
2022-04-06 13:25:09.096291: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09

2022-04-06 13:25:09.101018: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.memory_profile.json.gz
2022-04-06 13:25:09.101899: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09
Dumped tool data for xplane.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.xplane.pb
Dumped tool data for overview_page.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.overview_page.pb
Dumped tool data for input_pipeline.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to log/2022/04April/06Wednesday/run-13h24m42/tensorboard/train/plugins/profile/2022_04_06_13_25_09/barrachina-SONDRA.kernel_stats.pb

187/187 [==============================] - 10s 37ms/step - loss: 0.8277 - accuracy: 0.5412 - average_accuracy: 0.3043 - precision: 0.5026 - recall: 0.0087 - val_loss: 0.8309 - val_accuracy: 0.6880 - val_average_accuracy: 0.2931 - val_precision: 0.6810 - val_recall: 0.0047
memory usage 6.042584576 GB
memory usage 0.006478336 GB
2022-04-06 13:25:16.022531: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
memory usage 0.012938752 GB
2022-04-06 13:25:18.885690: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
187/187 [==============================] - 4s 16ms/step - loss: 0.8138 - accuracy: 0.6999 - average_accuracy: 0.2968 - precision: 0.6710 - recall: 0.0058
memory usage 5.90003712 GB
2022-04-06 13:25:24.458057: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4396941312 exceeds 10% of free system memory.
2022-04-06 13:25:35.851249: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.09GiB (rounded to 4396941312)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-04-06 13:25:35.851336: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2022-04-06 13:25:35.851375: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256):  Total Chunks: 263, Chunks in use: 263. 65.8KiB allocated for chunks. 65.8KiB in use in bin. 15.2KiB client-requested in use in bin.
2022-04-06 13:25:35.851405: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (512):  Total Chunks: 71, Chunks in use: 70. 42.2KiB allocated for chunks. 41.8KiB in use in bin. 36.0KiB client-requested in use in bin.
2022-04-06 13:25:35.851432: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1024):     Total Chunks: 10, Chunks in use: 9. 15.0KiB allocated for chunks. 14.0KiB in use in bin. 12.6KiB client-requested in use in bin.
2022-04-06 13:25:35.851456: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2048):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851483: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4096):     Total Chunks: 6, Chunks in use: 6. 31.5KiB allocated for chunks. 31.5KiB in use in bin. 30.4KiB client-requested in use in bin.
2022-04-06 13:25:35.851511: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8192):     Total Chunks: 12, Chunks in use: 12. 123.0KiB allocated for chunks. 123.0KiB in use in bin. 121.5KiB client-requested in use in bin.
2022-04-06 13:25:35.851534: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16384):    Total Chunks: 1, Chunks in use: 0. 30.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851560: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (32768):    Total Chunks: 13, Chunks in use: 11. 579.0KiB allocated for chunks. 475.5KiB in use in bin. 445.5KiB client-requested in use in bin.
2022-04-06 13:25:35.851586: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (65536):    Total Chunks: 1, Chunks in use: 1. 73.8KiB allocated for chunks. 73.8KiB in use in bin. 40.5KiB client-requested in use in bin.
2022-04-06 13:25:35.851610: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (131072):   Total Chunks: 18, Chunks in use: 18. 2.85MiB allocated for chunks. 2.85MiB in use in bin. 2.72MiB client-requested in use in bin.
2022-04-06 13:25:35.851634: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (262144):   Total Chunks: 2, Chunks in use: 1. 769.5KiB allocated for chunks. 283.5KiB in use in bin. 162.0KiB client-requested in use in bin.
2022-04-06 13:25:35.851658: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (524288):   Total Chunks: 11, Chunks in use: 10. 6.96MiB allocated for chunks. 6.33MiB in use in bin. 6.33MiB client-requested in use in bin.
2022-04-06 13:25:35.851682: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1048576):  Total Chunks: 2, Chunks in use: 2. 2.25MiB allocated for chunks. 2.25MiB in use in bin. 1.27MiB client-requested in use in bin.
2022-04-06 13:25:35.851704: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2097152):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851725: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4194304):  Total Chunks: 1, Chunks in use: 0. 4.43MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851769: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8388608):  Total Chunks: 1, Chunks in use: 0. 12.44MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851799: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851821: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851841: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851865: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (134217728):    Total Chunks: 1, Chunks in use: 0. 128.08MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-04-06 13:25:35.851889: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (268435456):    Total Chunks: 3, Chunks in use: 2. 8.60GiB allocated for chunks. 5.48GiB in use in bin. 5.46GiB client-requested in use in bin.
2022-04-06 13:25:35.851911: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 4.09GiB was 256.00MiB, Chunk State: 
2022-04-06 13:25:35.851941: I tensorflow/core/common_runtime/bfc_allocator.cc:1033]   Size: 3.12GiB | Requested Size: 1.97MiB | in_use: 0 | bin_num: 20, prev:   Size: 512B | Requested Size: 384B | in_use: 1 | bin_num: -1
2022-04-06 13:25:35.851960: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 9401270272
2022-04-06 13:25:35.851981: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000000 of size 256 next 4
2022-04-06 13:25:35.851999: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000100 of size 256 next 6
2022-04-06 13:25:35.852016: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000200 of size 256 next 3
2022-04-06 13:25:35.852032: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000300 of size 256 next 5
2022-04-06 13:25:35.852048: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000400 of size 256 next 9
2022-04-06 13:25:35.852064: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000500 of size 256 next 7
2022-04-06 13:25:35.852080: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000600 of size 256 next 8
2022-04-06 13:25:35.852097: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000700 of size 256 next 10
2022-04-06 13:25:35.852113: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000800 of size 256 next 13
2022-04-06 13:25:35.852128: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000900 of size 256 next 14
2022-04-06 13:25:35.852144: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000a00 of size 256 next 15
2022-04-06 13:25:35.852159: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000b00 of size 256 next 83
2022-04-06 13:25:35.852174: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000c00 of size 256 next 17
2022-04-06 13:25:35.852189: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000d00 of size 256 next 18
2022-04-06 13:25:35.852204: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fc942000e00 of size 256 next 21
.... Many messages like this, StackOverflow limits my max characters so I cropped it.
2022-04-06 13:25:35.858545: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7fcaaace5700 of size 512 next 320
2022-04-06 13:25:35.858561: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] Free  at 7fcaaace5900 of size 3347949312 next 18446744073709551615
2022-04-06 13:25:35.858576: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2022-04-06 13:25:35.858600: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 263 Chunks of size 256 totalling 65.8KiB
2022-04-06 13:25:35.858620: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 43 Chunks of size 512 totalling 21.5KiB
2022-04-06 13:25:35.858639: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 27 Chunks of size 768 totalling 20.2KiB
2022-04-06 13:25:35.858658: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1024 totalling 1.0KiB
2022-04-06 13:25:35.858675: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 2 Chunks of size 1280 totalling 2.5KiB
2022-04-06 13:25:35.858694: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 6 Chunks of size 1792 totalling 10.5KiB
2022-04-06 13:25:35.858712: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 6 Chunks of size 5376 totalling 31.5KiB
2022-04-06 13:25:35.858732: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 12 Chunks of size 10496 totalling 123.0KiB
2022-04-06 13:25:35.858751: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 9 Chunks of size 41472 totalling 364.5KiB
2022-04-06 13:25:35.858770: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 51456 totalling 50.2KiB
2022-04-06 13:25:35.858789: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 62208 totalling 60.8KiB
2022-04-06 13:25:35.858807: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 75520 totalling 73.8KiB
2022-04-06 13:25:35.858826: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 5 Chunks of size 147456 totalling 720.0KiB
2022-04-06 13:25:35.858845: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 11 Chunks of size 165888 totalling 1.74MiB
2022-04-06 13:25:35.858863: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 176640 totalling 172.5KiB
2022-04-06 13:25:35.858882: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 248832 totalling 243.0KiB
2022-04-06 13:25:35.858901: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 290304 totalling 283.5KiB
2022-04-06 13:25:35.858919: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 10 Chunks of size 663552 totalling 6.33MiB
2022-04-06 13:25:35.858937: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 2 Chunks of size 1179648 totalling 2.25MiB
2022-04-06 13:25:35.858955: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1489978112 totalling 1.39GiB
2022-04-06 13:25:35.858973: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 4396941312 totalling 4.09GiB
2022-04-06 13:25:35.858991: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 5.49GiB
2022-04-06 13:25:35.859009: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 9401270272 memory_limit_: 9401270272 available bytes: 0 curr_region_allocation_bytes_: 18802540544
2022-04-06 13:25:35.859036: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      9401270272
InUse:                      5900037120
MaxInUse:                   6431716864
NumAllocs:                      165083
MaxAllocSize:               4396941312
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-04-06 13:25:35.859127: W tensorflow/core/common_runtime/bfc_allocator.cc:468] *****************************************************************___________________________________
Traceback (most recent call last):
  File "/home/barrachina/Documents/onera/PolSar/principal_simulation.py", line 524, in <module>
    run_wrapper(model_name=args.model[0], balance=args.balance[0], tensorflow=args.tensorflow,
  File "/home/barrachina/Documents/onera/PolSar/principal_simulation.py", line 504, in run_wrapper
    df, dataset_handler, eval_df = run_model(model_name=model_name, balance=balance, tensorflow=tensorflow,
  File "/home/barrachina/Documents/onera/PolSar/principal_simulation.py", line 440, in run_model
    prediction_result = checkpoint_model.predict(train_ds[0])
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/training.py", line 1720, in predict
    data_handler = data_adapter.get_data_handler(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1383, in get_data_handler
    return DataHandler(*args, **kwargs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1138, in __init__
    self._adapter = adapter_cls(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 230, in __init__
    x, y, sample_weights = _process_tensorlike((x, y, sample_weights))
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1031, in _process_tensorlike
    inputs = tf.nest.map_structure(_convert_numpy_and_scipy, inputs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1026, in _convert_numpy_and_scipy
    return tf.convert_to_tensor(x, dtype=dtype)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/barrachina/anaconda3/envs/tf-pip/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

CodePudding user response：

This issue is probably related to this.

A solution for me (still super weird) is to do:

train_x = tf.convert_to_tensor(train_ds[0])

and use train_x instead of train_ds[0]

Now the weird thing. If I do also train_y = tf.convert_to_tensor(train_ds[1]) it does not work. I just need to convert train_ds[0] and only that one. And by work I mean that I can do evaluate and predict without clearing everything.

CodePudding user response：

train_ds should be a tf.data.Dataset or at least a tf.Tensor. If it's a numpy array or list or panda data structure, you don't get TF's maximum performance and optimization for stuff like memory allocation.