if you double click the coreml file in a mac and open xcode there is a profiler you can run. the profiler will show you the operations it's using and what the bit depth is.
On the coreml side this is likely because the neural engine supports fp16 and offloading some/all layers to ANE significantly increases inference time and power usage when running models. You can inspect in the Xcode profiler to see what is running on each part of the device at what precision.
While this is a bit too harsh - and the solution is naive at best - the problem is real.
The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
To frameworks somehow being even worse. Where the best you can do is order the frameworks in terms of how bad they are - with tensorflow being far down at the bottom and jax being (currently) at the top - and try to use the best one.
This is a huge issue to anyone serious about developing novel models and I see no one talking about it, let alone trying to solve it.
> Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
Not that strongly for more parallel things, quite similar to the situation with atomics on cuDNN. cuBLAS for example has a similar issue with multi-stream handling, though this can be overcome with a proper workspace allocation: https://docs.nvidia.com/cuda/cublas/index.html?highlight=Rep....
Still better than cuDNN where some operations just don't have a reproducible version though. The other fields are at least trying. DL doesn't seem to be.
> The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
It is quite annoying when you do parallelization, and idk if that many people cared about bitwise reproducibility, especially when it requires compromising a bit of performance.
The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
From NVidia not guaranteeing bitwise reproducibility even on the same GPU: https://docs.nvidia.com/deeplearning/cudnn/backend/v9.17.0/d...
To frameworks somehow being even worse. Where the best you can do is order the frameworks in terms of how bad they are - with tensorflow being far down at the bottom and jax being (currently) at the top - and try to use the best one.
This is a huge issue to anyone serious about developing novel models and I see no one talking about it, let alone trying to solve it.
Not that strongly for more parallel things, quite similar to the situation with atomics on cuDNN. cuBLAS for example has a similar issue with multi-stream handling, though this can be overcome with a proper workspace allocation: https://docs.nvidia.com/cuda/cublas/index.html?highlight=Rep....
Still better than cuDNN where some operations just don't have a reproducible version though. The other fields are at least trying. DL doesn't seem to be.
On that note Intel added reproducible BLAS to oneMKL on CPU and GPU last year. https://www.intel.com/content/www/us/en/developer/archive/tr...
It is quite annoying when you do parallelization, and idk if that many people cared about bitwise reproducibility, especially when it requires compromising a bit of performance.