[播客] Faster Python with Guido van Rossum

带完整文字稿的播客节目,时长 31 分钟。


Utsav: [09:45] Could you walk us through the idea of the tiers of execution of the Python Interpreter?

Guido: When you execute a program, you don’t know if it’s going to crash after running a fraction of a millisecond, or whether it’s going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that’s going to be run. But obviously, especially in dynamic languages like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing code as quickly as you can. So that if it’s a small script, or a large program that happens to fail early, or just exits early for a good reason, you don’t spend any time being distracted by optimizing all that code.

So, what we try to do there is keep the bytecode compiler simple so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and some definition of “hot”. For some purposes, maybe it’s a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it’s only hot if it’s been called 1000 times.”

The specializing adaptive compiler (PEP 659) then tries to replace certain bytecodes with bytecodes that are faster, but only work if the types of the arguments are specific types. A simple hypothetical example is the plus operator in Python. It can add lots of things like integers, strings, lists, or even tuples. On the other hand, you can’t add an integer to a string. So, the optimization step - often called quickening, but usually in our context, we call it specializing - is to have a separate “binary add” integer bytecode, a second-tier bytecode hidden from the user. This opcode assumes that both of its arguments are actual Python integer objects, reaches directly into those objects to find the values, adds those values together in machine registers, and pushes the result back on the stack.

The binary adds integer operation still has to make a type check on the arguments. So, it’s not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for most generic add operations.

Finally, it’s always possible that a function is called millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. At that point, the interpreter will simply execute the original bytecode. That’s an important part so that you still have the full Python semantics.

Utsav [18:20] Generally you hear of these techniques in the context of JIT, a Just-In-Time compiler, but that’s not being implemented right now.

Guido: Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we’re trying to avoid. In our case, it’s unclear what and when we’re exactly compiling. At some point ahead of program execution, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which part would you call Just-In-Time?

Also, it’s often assumed that Just-In-Time compilation automatically makes all your code better. Unfortunately, you often can’t actually predict what the performance of your code is going to be. And we have enough of that with modern CPUs and their fantastic branch prediction. For example, we write code in a way that we think will clearly reduce the number of memory accesses. When we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured out access patterns without any of our help. I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic.



1 个赞

虽然道理我都懂,但是为什么这么像 JIT

之前看到一个非常有启发性的关于 JIT 的报告: Just-in-Time Compilation - JF Bastien - CppCon 2020



1 个赞

Julia也算是诞生在风口上的动态语言,有了LLVM backend的支持让JIT变得“简单”。个人感觉Python当然可以借鉴Julia的思路:就像某个演讲里Stefan Karpinski说的,如果Numba早出来几年,可能就不会有Julia这个项目了。然而作为动态语言,他们最终会遇到和Julia一样的问题,最后就演变成“既然这么做了,为什么不直接用Julia”。

如今的MATLAB和Python都支持标注变量类型了,这是第一步,我的理解是还仅限于检查,而不是编译优化。第二步就会是利用类型的信息走JIT编译加速运算,就像Julia、Numba。我最早理解JIT,就简单地以为就像AOT编译语言一样,固定类型–>优化机器码;然而很久之后才意识到Julia的JIT并不是这么运作的,这也是为什么在Julia社区中总是强调类型稳定而不是强制类型的原因。对我而言理解这点并不容易,而我相信这也是为什么在一些资料中我看到有人说“you need to understand the language to make it fast and robust.”最早的时候我写Julia,总是习惯比如f(x::Float) = x + 1.0这样,大概是由于最早接触的都是C/C++/Fortran的缘故。

但是另一方面,JIT本身在Julia的type system中最近也有些让我困惑的问题。很多时候为了robustness我们不用指定具体类型,都是用abstract type或者干脆啥类型都不写,但根据我对Tim Holy在precompilation方便工作的理解,这样做的代价就是会使type inference的过程变得更慢,实际体现就是诸多的包载入时间和first-time-to-plot之类的延时非常显著。我隐隐地感觉这是一个trade-off,但其实并没有完全明白解决的思路是什么。


1 个赞