For writing kernels and profiling, yes, it's pretty nice. But the host-side API ...

For writing kernels and profiling, yes, it's pretty nice. But the host-side API is rather unapproachable. There are, like, 20 API calls just for different variants of memory allocation, all inconsistently named. Also, the APIs are almost entirely (but not 100%) C'ish.

In the API wrappers I've written: https://github.com/eyalroz/cuda-api-wrappers

I try to address these and some other issues.

We should also remember that NVIDIA artificially prevents its profiling tools from supporting OpenCL kernels - with no good reason.