What do the terms "CPU bound" and "I/O bound" mean?

A program is CPU bound if it would go faster if the CPU were faster, i.e. it spends the majority of its time simply using the CPU (doing calculations). A program that computes new digits of π will typically be CPU-bound, it's just crunching numbers.

A program is I/O bound if it would go faster if the I/O subsystem was faster. Which exact I/O system is meant can vary; I typically associate it with the disk, but of course, networking or communication, in general, is common too. A program that looks through a huge file for some data might become I/O bound since the bottleneck is then the reading of the data from disk (actually, this example is perhaps kind of old-fashioned these days with hundreds of MB/s coming in from SSDs).

2,651 2 2 gold badges 12 12 silver badges 26 26 bronze badges answered May 15, 2009 at 13:07 398k 64 64 gold badges 482 482 silver badges 615 615 bronze badges

How does this tie into understanding HTTP communication on a mobile device? I've seen CPU usage spike from using java.nio operations.

Commented Jun 28, 2018 at 19:49 I/O is "input/output". Commented May 30, 2019 at 4:40 old-fashioned no it's not, bandwidth isn't latency. Commented Nov 13, 2021 at 15:10

Computing new digits of π is 100% IO-bound. In fact, it's generally not even possible to do the calculations in RAM, even with hundreds of GB -- you need swap, and that means fast SSDs. (This makes sense -- you need to generate trillions of digits, and the digits depend on one another.)

Commented Jan 14, 2023 at 13:03

@Charles true enough for new digits of pi, though it'd apply for just any digits (i.e., crunching numbers, as the author stated). I'd say given the age of the answer, you could probably snip the "new" part from the answer here without changing the author's intent.

Commented Feb 8 at 16:56

CPU Bound means the rate at which process progresses is limited by the speed of the CPU. A task that performs calculations on a small set of numbers, for example multiplying small matrices, is likely to be CPU bound.

I/O Bound means the rate at which a process progresses is limited by the speed of the I/O subsystem. A task that processes data from disk, for example, counting the number of lines in a file is likely to be I/O bound.

Memory bound means the rate at which a process progresses is limited by the amount memory available and the speed of that memory access. A task that processes large amounts of in memory data, for example multiplying large matrices, is likely to be Memory Bound.

Cache bound means the rate at which a process progress is limited by the amount and speed of the cache available. A task that simply processes more data than fits in the cache will be cache bound.

I/O Bound would be slower than Memory Bound would be slower than Cache Bound would be slower than CPU Bound.

The solution to being I/O bound isn't necessarily to get more Memory. In some situations, the access algorithm could be designed around the I/O, Memory or Cache limitations. See Cache Oblivious Algorithms.

31.5k 22 22 gold badges 109 109 silver badges 132 132 bronze badges answered May 15, 2009 at 13:26 6,376 2 2 gold badges 18 18 silver badges 19 19 bronze badges thanks for the clear and useful summary esp. on the reference to Cache Oblivious Algorithms Commented Aug 25, 2021 at 12:27

Multi-threading is where it tends to matter the most

In this answer, I will investigate one important use case of distinguishing between CPU vs IO bounded work: when writing multi-threaded code.

RAM I/O bound example: Vector Sum

Consider a program that sums all the values of a single vector:

#define SIZE 1000000000 unsigned int is[SIZE]; unsigned int sum = 0; size_t i = 0; for (i = 0; i < SIZE; i++) /* Each one of those requires a RAM access! */ sum += is[i]

Parallelizing that by splitting the array equally for each of your cores is of limited usefulness on common modern desktops.

For example, on my Ubuntu 19.04, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I get results like this:

Note that there is a lot of variance between run however. But I can't increase the array size much further since I'm already at 8GiB, and I'm not in the mood for statistics across multiple runs today. This seemed however like a typical run after doing many manual runs.

POSIX C pthread source code used in the graph. And here is a C++ version that produces analogous results.
plot script

I don't know enough computer architecture to fully explain the shape of the curve, but one thing is clear: the computation does not become 8x faster as naively expected due to me using all my 8 threads! For some reason, 2 and 3 threads was the optimum, and adding more just makes things much slower.

Compare this to CPU bound work, which actually does get 8 times faster: What do 'real', 'user' and 'sys' mean in the output of time(1)?

The reason it is all processors share a single memory bus linking to RAM:

CPU 1 --\ Bus +-----+ CPU 2 ---\__________| RAM | . ---/ +-----+ CPU N --/

so the memory bus quickly becomes the bottleneck, not the CPU.

This happens because adding two numbers takes a single CPU cycle, memory reads take about 100 CPU cycles in 2016 hardware.

So the CPU work done per byte of input data is too small, and we call this an IO-bound process.

The only way to speed up that computation further, would be to speed up individual memory accesses with new memory hardware, e.g. Multi-channel memory.

Upgrading to a faster CPU clock for example would not be very useful.

Other examples

2 * N**2

numbers, but:

N ** 3

http://www.netlib.org/scalapack/pblas_qref.html
http://icl.cs.utk.edu/magma/software/

Cache usage makes a big difference to the speed of implementations. See for example this didactic GPU comparison example.

Why can GPU do matrix multiplication faster than CPU?
BLAS equivalent of a LAPACK function for GPUs

serial
parallel

Workload Name (iter/s) (iter/s) Scaling ----------------------------------------------- ---------- ---------- ---------- cjpeg-rose7-preset 526.32 178.57 2.95 core 7.39 2.16 3.42 linear_alg-mid-100x100-sp 684.93 238.10 2.88 loops-all-mid-10k-sp 27.65 7.80 3.54 nnet_test 32.79 10.57 3.10 parser-125k 71.43 25.00 2.86 radix2-big-64k 2320.19 623.44 3.72 sha-test 555.56 227.27 2.44 zip-test 363.64 166.67 2.18 MARK RESULTS TABLE Mark Name MultiCore SingleCore Scaling ----------------------------------------------- ---------- ---------- ---------- CoreMark-PRO 18743.79 6306.76 2.97

How to find out if you are CPU or IO bound

Or, if execution is quick, and you parametrize the number of threads, you can see it easily from time that performance improves as the number of threads increases for CPU bound work: What do 'real', 'user' and 'sys' mean in the output of time(1)?

RAM-IO bound: harder to tell, as RAM wait time it is included in CPU% measurements, see also:

How to check if app is cpu-bound or memory-bound?
https://askubuntu.com/questions/1540/how-can-i-find-out-if-a-process-is-cpu-memory-or-disk-bound

Intel Advisor Roofline (non-free): https://software.intel.com/en-us/articles/intel-advisor-roofline (archive) "A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks."

GPUs

GPUs have an IO bottleneck when you first transfer the input data from the regular CPU readable RAM to the GPU.

Therefore, GPUs can only be better than CPUs for CPU bound applications.

Once the data is transferred to the GPU however, it can operate on those bytes faster than the CPU can, because the GPU:

has more data localization than most CPU systems, and so data can be accessed faster for some cores than others
exploits data parallelism and sacrifices latency by just skipping over any data that is not ready to be operated on immediately. Since the GPU has to operate on large parallel input data, it is better to just skip to the next data that might be available instead of waiting for the current data to be come available and block all other operations like the CPU mostly does

Therefore the GPU can be faster then a CPU if your application:

can be highly parallelized: different chunks of data can be treated separately from one another at the same time
requires a large enough number of operations per input byte (unlike e.g. vector addition which does one addition per byte only)
there is a large number of input bytes

These designs choices originally targeted the application of 3D rendering, whose main steps are as shown at What are shaders in OpenGL and what do we need them for?

vertex shader: multiplying a bunch of 1x4 vectors by a 4x4 matrix
fragment shader: calculate the color of each pixel of a triangle based on its relative position withing the triangle

and so we conclude that those applications are CPU-bound.

With the advent of programmable GPGPU, we can observe several GPGPU applications that serve as examples of CPU bound operations:

Image Processing with GLSL shaders? Local image processing operations such as a blur filter are highly parallel in nature.
Is it possible to build a heatmap from point data at 60 times per second? Plotting of heatmap graphs if the plotted function is complex enough.
https://www.youtube.com/watch?v=fE0P6H8eK4I "Real-Time Fluid Dynamics: CPU vs GPU" by Jesús Martín Berlanga Solving partial differential equations such as the Navier Stokes equation of fluid dynamics:
- highly parallel in nature, because each point only interacts with their neighbour
- there tend to be enough operations per byte
- Why are we still using CPUs instead of GPUs?
- What are GPUs bad at?
- https://www.youtube.com/watch?v=_cyVDoyI6NE "CPU vs GPU (What's the Difference?) - Computerphile"
CPython Global Intepreter Lock (GIL)

As a quick case study, I want to point out to the Python Global Interpreter Lock (GIL): What is the global interpreter lock (GIL) in CPython?

This CPython implementation detail prevents multiple Python threads from efficiently using CPU-bound work. The CPython docs say:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor . However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

Therefore, here we have an example where CPU-bound content is not suitable and I/O bound is.

JavaScript async and Node.js worker_threads

The situation is similar to Python's.

JavaScript is fundamentally single threaded. Not sure if this is part of the language standard or not (for Python it is not, there isn't even a language standard besides the reference CPython implementation AFAIK).

JavaScript has the async keyword which allows execution to stall, and it then it start executing something else. You can write stuff like:
```
async function myfunc(init) < ret = init for (let i = 0; i < 1000000; i++) < ret += i*i + 2*i + 3 >return ret; > async function doNetworkRequest(init) < // Some native method that gets network data. >const [first, second, myfunc_ret1, myfunc_ret2] = await Promise.all([ doNetworkRequest('example.com'), doNetworkRequest('example2.com'), myfunc(1), myfunc(2), ]) 
```
where await says "wait for all these async things to finish before moving on".

However, only one of the async methods can run at a time on your CPU, so the CPU intensive work of myfunc does not get sped up by this at all.

The prototypically IO bound operation of network access however can get sped up, as this would fire both network requests one after the other, and just wait for both to return while the servers/network do the work.

The fact that there is an actual keyword in the language dedicated for this, async , is telling: unsurprisingly, network requests are very important in the browser dominated context of JavaScript.

With the advent of Node.js however, people started to want to parallelize CPU workload as well more and more, and they reached a similar solution to CPython: create separate processes rather than threads. This is done with the worker_threads library, which:
- actually takes a JavaScript script path as input, and launches a new interpreter instance to run it
- since the processes don't share memory space, you pass messages between instances with a messaging system
The documentation of worker_threads once more states the difference as mentioned elsewhere in this answer:

Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.

On the browser there's also Web Workers, not sure how it compares to Node's implementation: What's the difference between Web Workers and Worker Threads?
answered Nov 3, 2015 at 22:41 Ciro Santilli OurBigBook.com Ciro Santilli OurBigBook.com 374k 114 114 gold badges 1.3k 1.3k silver badges 1k 1k bronze badges Man, how do you manage to be so competent and give a write like this one? Commented Jan 13, 2020 at 4:47
@MikayilAbdullayev thanks! I tend to only answer "important questions" and I tend to come back to them over and over as I learn new relevant stuff.
Commented Jan 13, 2020 at 9:21 @CiroSantilliOurBigBook.com ~Hats off bro! Commented Mar 15, 2023 at 17:02
CPU bound means the program is bottlenecked by the CPU, or central processing unit, while I/O bound means the program is bottlenecked by I/O, or input/output, such as reading or writing to disk, network, etc.

In general, when optimizing computer programs, one tries to seek out the bottleneck and eliminate it. Knowing that your program is CPU bound helps, so that one doesn't unnecessarily optimize something else.

[And by "bottleneck", I mean the thing that makes your program go slower than it otherwise would have.]
31.5k 22 22 gold badges 109 109 silver badges 132 132 bronze badges answered May 15, 2009 at 13:08 Chris W. Rea Chris W. Rea 5,481 42 42 silver badges 58 58 bronze badges
Another way to phrase the same idea:
- If speeding up the CPU doesn't speed up your program, it may be I/O bound.
- If speeding up the I/O (e.g. using a faster disk) doesn't help, your program may be CPU bound.
(I used "may be" because you need to take other resources into account. Memory is one example.)
31.5k 22 22 gold badges 109 109 silver badges 132 132 bronze badges answered May 15, 2009 at 13:26 85.6k 10 10 gold badges 78 78 silver badges 105 105 bronze badges
When your program is waiting for I/O (ie. a disk read/write or network read/write etc), the CPU is free to do other tasks even if your program is stopped. The speed of your program will mostly depend on how fast that IO can happen, and if you want to speed it up you will need to speed up the I/O.

If your program is running lots of program instructions and not waiting for I/O, then it is said to be CPU bound. Speeding up the CPU will make the program run faster.

In either case, the key to speeding up the program might not be to speed up the hardware, but to optimize the program to reduce the amount of IO or CPU it needs, or to have it do I/O while it also does CPU intensive stuff.
31.5k 22 22 gold badges 109 109 silver badges 132 132 bronze badges answered May 15, 2009 at 13:10 Paul Tomblin Paul Tomblin 182k 59 59 gold badges 323 323 silver badges 409 409 bronze badges
IO bound processes: spend more time doing IO than computations, have many short CPU bursts. CPU bound processes: spend more time doing computations, few very long CPU bursts
answered May 2, 2014 at 6:09 101 1 1 silver badge 2 2 bronze badges
- For I/O-bound code, you await an operation which returns a Task or Task inside of an async method.
- For CPU-bound code, you await an operation which is started on a background thread with the Task.Run method.
```
private readonly HttpClient _httpClient = new HttpClient(); downloadButton.Clicked += async (o, e) => < // This line will yield control to the UI as the request // from the web service is happening. // // The UI thread is now free to perform other work. var stringData = await _httpClient.GetStringAsync(URL); DoSomethingWithData(stringData); >; 
```
```
private DamageResult CalculateDamageDone() < // Code omitted: // // Does an expensive calculation and returns // the result of that calculation. >calculateButton.Clicked += async (o, e) => < // This line will yield control to the UI while CalculateDamageDone() // performs its work. The UI thread is free to perform other work. var damageResult = await Task.Run(() =>CalculateDamageDone()); DisplayDamage(damageResult); >; 
```
- If your answer is "yes", then your work is I/O-bound.
- If you answered "yes", then your work is CPU-bound.
answered Jan 10, 2020 at 20:24 Taşyürek Gökşah Taşyürek Gökşah 329 5 5 silver badges 14 14 bronze badges
I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed.

This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
community wiki
An application is CPU-bound when the arithmetic/logical/floating-point (A/L/FP) performance during the execution is mostly near the theoretical peak-performance of the processor (data provided by the manufacturer and determined by the characteristics of the processor: number of cores, frequency, registers, ALUs, FPUs, etc.).

The peek performance is very difficult to be achieved in real-world applications, for not saying impossible. Most of the applications access memory in different parts of the execution and the processor is not doing A/L/FP operations during several cycles. This is called Von Neumann Limitation due to the distance that exists between the memory and the processor.

If you want to be near the CPU peak-performance a strategy could be to try to reuse most of the data in the cache memory in order to avoid requiring data from the main memory. An algorithm that exploits this feature is the matrix-matrix multiplication (if both matrices can be stored in the cache memory). This happens because if the matrices are size n x n then you need to do about 2 n^3 operations using only 2 n^2 FP numbers of data. On the other hand matrix addition, for example, is a less CPU-bound or a more memory-bound application than the matrix multiplication since it requires only n^2 FLOPs with the same data.

In the following figure the FLOPs obtained with a naive algorithms for the matrix addition and the matrix multiplication in an Intel i5-9300H, is shown:

Note that as expected the performance of the matrix multiplication in bigger than the matrix addition. These results can be reproduced by running test/gemm and test/matadd available in this repository.

I suggest also to see the video given by J. Dongarra about this effect.