Is im2col possible?

Is the `Tensor` type suited to implement an `im2col` operation? I tried and only succeeded with nested loops—which of course is bad for CUDA.

In the end, I want to arrive at an efficient convolution. Would that be possible with the current API surface?