About The Corbeille of Nefta (La Corbeille de Nefta) is a unique natural and historical site located in the oasis town of Nefta in southern Tunisia, near the Algerian border. Known for its stunning beauty, the natural springs here water nearly half a million date palms, one of the largest…
2025
Thoughtful stories for thoughtless times. Longreads has published hundreds of original stories—personal essays, reported features, reading lists, and more—and more than 13,000 editor’s picks. And they’re all funded by readers like you. Become a member today. I want to support Longreads In today’s…
New to me, a couple decades ago, author Kurt Vonnegut delivered a lecture on the shape of stories. He uses a diagrammatic line chart to illustrate. The y-axis represents a range from ill fortune to good fortune, and the x-axis represents beginning to end of a story. Vonnegut identifies four stories…
What It Means To collaborate is to work with another person or group in order to do or achieve something. Collaborate can also be used disapprovingly to mean "to cooperate with or willingly assist an enemy of one's country and especially an enemy who occupies it during a war." // Several research…
Algorithmica is an open-access web book dedicated to the art and science of computing. It is created by Sergey Slotin and the teachers and students of Tinkoff Generation — a nonprofit educational organization that trains about half of the finalists of the Russian Olympiad in Informatics. The English…
This is an upcoming high performance computing book titled “Algorithms for Modern Hardware” by Sergey Slotin. Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms…
If you ever opened a computer science textbook, it probably introduced computational complexity somewhere in the very beginning. Simply put, it is the total count of elementary operations (additions, multiplications, reads, writes…) that are executed during a computation, optionally weighted by…
The main disadvantage of the supercomputers of the 1960s wasn’t that they were slow — relatively speaking, they weren’t — but that they were giant, complex to use, and so expensive that only the governments of the world superpowers could afford them. Their size was the reason they were so expensive:…
If you are reading this book, then somewhere on your computer science journey you had a moment when you first started to care about the efficiency of your code. Mine was in high school, when I realized that making websites and doing useful programming won’t get you into a university, and entered the…
When I began learning how to optimize programs myself, one big mistake I made was to rely primarily on the empirical approach. Not understanding how computers really worked, I would semi-randomly swap nested loops, rearrange arithmetic, combine branch conditions, inline functions by hand, and follow…
As software engineers, we absolutely love building and using abstractions. Just imagine how much stuff happens when you load a URL. You type something on a keyboard; key presses are somehow detected by the OS and get sent to the browser; browser parses the URL and asks the OS to make a network…
CPUs are controlled with machine language, which is just a stream of binary-encoded instructions that specify the instruction number (called opcode),what its operands are (if there are any),and where to store the result (if one is produced). A much more human-friendly rendition of machine language,…
Let’s consider a slightly more complex example: It calculates the sum of a 32-bit integer array, just as a simple for loop would. The “body” of the loop is add edx, DWORD PTR [rax]: this instruction loads data from the iterator rax and adds it to the accumulator edx. Next, we move the iterator 4…
Computer engineers like to mentally split the pipeline of a CPU into two parts: the front-end, where instructions are fetched from memory and decoded, and the back-end, where they are scheduled and finally executed. Typically, the performance is bottlenecked by the execution stage, and for this…
To “call a function” in assembly, you need to jump to its beginning and then jump back. But then two important problems arise: What if the caller stores data in the same registers as the callee?Where is “back”? Both of these concerns can be solved by having a dedicated location in memory where we…
During assembly, all labels are converted to addresses (absolute or relative) and then encoded into jump instructions. You can also jump by a non-constant value stored inside a register, which is called a computed jump: This has a few interesting applications related to dynamic languages and…
When programmers hear the word parallelism, they mostly think about multi-core parallelism, the practice of explicitly splitting a computation into semi-independent threads that work together to solve a common problem. This type of parallelism is mainly about reducing latency and achieving…
Pipelining lets you hide the latencies of instructions by running them concurrently, but also creates some potential obstacles of its own — characteristically called pipeline hazards, that is, situations when the next instruction cannot execute on the following clock cycle. There are multiple ways…
When a CPU encounters a conditional jump or any other type of branching, it doesn’t just sit idle until its condition is computed — instead, it starts speculatively executing the branch that seems more likely to be taken immediately. During execution, the CPU computes statistics about branches taken…
As we established in the previous section, branches that can’t be effectively predicted by the CPU are expensive as they may cause a long pipeline stall to fetch new instructions after a branch mispredict. In this section, we discuss the means of removing branches in the first place. #Predication We…
Interleaving the stages of execution is a general idea in digital electronics, and it is applied not only in the main CPU pipeline, but also on the level of separate instructions and memory. Most execution units have their own little pipelines and can take another instruction just one or two cycles…
Optimizing for latency is usually quite different from optimizing for throughput: When optimizing data structure queries or small one-time or branchy algorithms, you need to look up the latencies of its instructions, mentally construct the execution graph of the computation, and then try to…
The main benefit of learning assembly language is not the ability to write programs in it, but the understanding of what is happening during the execution of compiled code and its performance implications. There are rare cases where we really need to switch to handwritten assembly for maximal…
Before jumping straight to compiler optimizations, which is what most of this chapter is about, let’s briefly recap the “big picture” first. Skipping the boring parts, there are 4 stages of turning C programs into executables: Preprocessing expands macros, pulls included source from header files,…
The first step of getting high performance from the compiler is to ask for it, which is done with over a hundred different compiler options, attributes, and pragmas. #Optimization Levels There are 4 and a half main levels of optimization for speed in GCC: -O0 is the default one that does no…
Most compiler optimizations enabled by -O2 and -O3 are guaranteed to either improve or at least not seriously hurt performance. Those that aren’t included in -O3 are either not strictly standard-compliant, or highly circumstantial and require some additional input from the programmer to help decide…
In “safe” languages like Java and Rust, you normally have well-defined behavior for every possible operation and every possible input. There are some things that are under-defined, like the order of keys in a hash table or the growth factor of an std::vector, but these are usually some minor details…
When compilers can infer that a certain variable does not depend on any user-provided data, they can compute its value during compile time and turn it into a constant by embedding it into the generated machine code. This optimization helps performance a lot, but it is not a part of the C++ standard,…
Staring at the source code or its assembly is a popular, but not the most effective way of finding performance issues. When the performance doesn’t meet your expectations, you can identify the root cause much faster using one of the special program analysis tools collectively called profilers. There…
Instrumentation is an overcomplicated term that means inserting timers and other tracking code into programs. The simplest example is using the time utility in Unix-like systems to measure the duration of execution for the whole program. More generally, we want to know which parts of the program…
Instrumentation is a rather tedious way of doing profiling, especially if you are interested in multiple small sections of the program. And even if it can be partially automated by the tooling, it still won’t help you gather some fine-grained statistics because of its inherent overhead. Another,…
The last approach to profiling (or rather a group of them) is not to gather the data by actually running the program but to analyze what should happen by simulating it with specialized tools. There are many subcategories of such profilers, differing in which aspect of computation is simulated. In…
A machine code analyzer is a program that takes a small snippet of assembly code and simulates its execution on a particular microarchitecture using information available to compilers, and outputs the latency and throughput of the whole block, as well as cycle-perfect utilization of various…
Most good software engineering practices in one way or another address the issue of making development cycles faster: you want to compile your software faster (build systems), catch bugs as soon as possible (static analysis, continuous integration), release as soon as the new version is ready…
It is not an uncommon for there to be two library algorithm implementations, each maintaining its own benchmarking code, and each claiming to be faster than the other. This confuses everyone involved, especially the users, who have to somehow choose between the two. Situations like these are usually…
As we repeatedly demonstrate throughout this book, knowing darker corners of the instruction set can be very fruitful, especially in the case of CISC platforms like x86, which currently has somewhere between 1000 and 4000 distinct instructions, depending on how you count. Most of these instructions…
The users of floating-point arithmetic deserve one of these IQ bell curve memes — because this is how the relationship between it and most people typically proceeds: Beginner programmers use it everywhere as if it was some magic unlimited-precision data type.Then they discover that 0.1 + 0.2 != 0.3…
When we designed our DIY floating-point type, we omitted quite a lot of important little details: How many bits do we dedicate for the mantissa and the exponent?Does a 0 sign bit mean +, or is it the other way around?How are these bits stored in memory?How do we represent 0?How exactly does rounding…
The way rounding works in hardware floats is remarkably simple: it occurs if and only if the result of the operation is not representable exactly, and by default gets rounded to the nearest representable number (in case of a tie preferring the number that ends with a zero). Consider the following…
Reaching the maximum possible precision is rarely required from a practical algorithm. In real-world data, modeling and measurement errors are usually several orders of magnitude larger than the errors that come from rounding floating-point numbers and such, and we are often perfectly happy with…
No articles.