OMEinsum.jl 相比 TensorOperations.jl 存在更多的内存分配

gxliu · 2023 年3 月 30 日 03:33

当每个指标维度不同时，张量网络的收缩顺序对于性能有很大的影响。但指标相同时，收缩顺序的影响应该很小。经测试发现，当交换相同维度的指标后，OMEinsum.jl 相比 TensorOperations.jl 存在更多的内存分配。测试如下：

using TensorOperations

function testTensorOperations(dim::Int)
    a = randn(dim, dim)
    b = randn(dim, dim, 4)

    println("This is TensorOperations test: ")
    @btime @tensor c[x,z,w] := a[x,y] * b[y,z,w];
    @btime @tensor c[x,z,w] := a[y,x] * b[y,z,w];
end

> testTensorOperations(4000)
This is TensorOperations test: 
  3.058 s (3 allocations: 488.28 MiB)
  3.048 s (3 allocations: 488.28 MiB)

using OMEinsum

function testOMEinsum(dim::Int)
    a = randn(dim, dim)
    b = randn(dim, dim, 4)

    println("This is OMEinsum test: ")
    @btime @ein A[x, z, w]:= a[x, y] * b[y, z, w];
    @btime @ein A[x, z, w]:= a[y, x] * b[y, z, w];
end

> testOMEinsum(4000)
This is OMEinsum test: 
  3.067 s (65 allocations: 488.29 MiB)
  3.091 s (85 allocations: 610.36 MiB)

这组测试（实际上还有很多组）验证了我关于差别很小的猜测，但也同时发现 OMEinsum.jl 相比 TensorOperations.jl 存在更多的内存分配。根据英文社区相关的讨论以及这个 Github issue，我猜测原因可能是： OMEinsum.jl 要同时支持 GPU ，但 GPUArrays 还不支持 in place 版本的 permutedims，所以放弃了这一步优化。而TensorOperations.jl 使用Jutho 自己开发的 Strided.jl 实现了更高效的操作。

不知道理解得对不对，欢迎感兴趣的朋友共同探讨！

GiggleLiu · 2023 年3 月 30 日 06:42

是这样的，OMEinsum 放弃了类型稳定性。但是和 GPU 无关，而是 OMEinsum 主要的应用场景是大型张量网络优化。OMEinsum 最大的特点是高维张量网络的收缩顺序优化。这时候的张量的维度经常会很高，比如 20 维的张量和20维的张量收缩。那么这么高纬度的张量怎么收缩，可以有指数多种组合。这时候，不可能对每种组合都编译不同的函数，否则会有巨大的overhead。

那么带来的问题是，Julia 在编译期没法知道输出的类型是什么。这个问题我现在没有很好的解决方案。

给用户的建议是，如果你处理的问题里面要降低静态编译的 overhead，那就用 OMEinsum。如果你要更好的类型稳定性以及小矩阵性能，那就用 TensorOperations。