What do the terms "CPU bound" and "I/O bound" mean?

A program is CPU bound if it would go faster if the CPU were faster, i.e. it spends the majority of its time simply using the CPU (doing calculations). A program that computes new digits of π will typically be CPU-bound, it's just crunching numbers.

A program is I/O bound if it would go faster if the I/O subsystem was faster. Which exact I/O system is meant can vary; I typically associate it with the disk, but of course, networking or communication, in general, is common too. A program that looks through a huge file for some data might become I/O bound since the bottleneck is then the reading of the data from disk (actually, this example is perhaps kind of old-fashioned these days with hundreds of MB/s coming in from SSDs).

2,651 2 2 gold badges 12 12 silver badges 26 26 bronze badges answered May 15, 2009 at 13:07 398k 64 64 gold badges 482 482 silver badges 615 615 bronze badges

How does this tie into understanding HTTP communication on a mobile device? I've seen CPU usage spike from using java.nio operations.

Commented Jun 28, 2018 at 19:49 I/O is "input/output". Commented May 30, 2019 at 4:40 old-fashioned no it's not, bandwidth isn't latency. Commented Nov 13, 2021 at 15:10

Computing new digits of π is 100% IO-bound. In fact, it's generally not even possible to do the calculations in RAM, even with hundreds of GB -- you need swap, and that means fast SSDs. (This makes sense -- you need to generate trillions of digits, and the digits depend on one another.)

Commented Jan 14, 2023 at 13:03

@Charles true enough for new digits of pi, though it'd apply for just any digits (i.e., crunching numbers, as the author stated). I'd say given the age of the answer, you could probably snip the "new" part from the answer here without changing the author's intent.

Commented Feb 8 at 16:56

CPU Bound means the rate at which process progresses is limited by the speed of the CPU. A task that performs calculations on a small set of numbers, for example multiplying small matrices, is likely to be CPU bound.

I/O Bound means the rate at which a process progresses is limited by the speed of the I/O subsystem. A task that processes data from disk, for example, counting the number of lines in a file is likely to be I/O bound.

Memory bound means the rate at which a process progresses is limited by the amount memory available and the speed of that memory access. A task that processes large amounts of in memory data, for example multiplying large matrices, is likely to be Memory Bound.

Cache bound means the rate at which a process progress is limited by the amount and speed of the cache available. A task that simply processes more data than fits in the cache will be cache bound.

I/O Bound would be slower than Memory Bound would be slower than Cache Bound would be slower than CPU Bound.

The solution to being I/O bound isn't necessarily to get more Memory. In some situations, the access algorithm could be designed around the I/O, Memory or Cache limitations. See Cache Oblivious Algorithms.

31.5k 22 22 gold badges 109 109 silver badges 132 132 bronze badges answered May 15, 2009 at 13:26 6,376 2 2 gold badges 18 18 silver badges 19 19 bronze badges thanks for the clear and useful summary esp. on the reference to Cache Oblivious Algorithms Commented Aug 25, 2021 at 12:27

Multi-threading is where it tends to matter the most

In this answer, I will investigate one important use case of distinguishing between CPU vs IO bounded work: when writing multi-threaded code.

RAM I/O bound example: Vector Sum

Consider a program that sums all the values of a single vector:

#define SIZE 1000000000 unsigned int is[SIZE]; unsigned int sum = 0; size_t i = 0; for (i = 0; i < SIZE; i++) /* Each one of those requires a RAM access! */ sum += is[i] 

Parallelizing that by splitting the array equally for each of your cores is of limited usefulness on common modern desktops.

For example, on my Ubuntu 19.04, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I get results like this:

Note that there is a lot of variance between run however. But I can't increase the array size much further since I'm already at 8GiB, and I'm not in the mood for statistics across multiple runs today. This seemed however like a typical run after doing many manual runs.

I don't know enough computer architecture to fully explain the shape of the curve, but one thing is clear: the computation does not become 8x faster as naively expected due to me using all my 8 threads! For some reason, 2 and 3 threads was the optimum, and adding more just makes things much slower.

Compare this to CPU bound work, which actually does get 8 times faster: What do 'real', 'user' and 'sys' mean in the output of time(1)?

The reason it is all processors share a single memory bus linking to RAM:

CPU 1 --\ Bus +-----+ CPU 2 ---\__________| RAM | . ---/ +-----+ CPU N --/ 

so the memory bus quickly becomes the bottleneck, not the CPU.

This happens because adding two numbers takes a single CPU cycle, memory reads take about 100 CPU cycles in 2016 hardware.

So the CPU work done per byte of input data is too small, and we call this an IO-bound process.

The only way to speed up that computation further, would be to speed up individual memory accesses with new memory hardware, e.g. Multi-channel memory.

Upgrading to a faster CPU clock for example would not be very useful.

Other examples

2 * N**2 
numbers, but:
N ** 3 

Cache usage makes a big difference to the speed of implementations. See for example this didactic GPU comparison example.

Workload Name (iter/s) (iter/s) Scaling ----------------------------------------------- ---------- ---------- ---------- cjpeg-rose7-preset 526.32 178.57 2.95 core 7.39 2.16 3.42 linear_alg-mid-100x100-sp 684.93 238.10 2.88 loops-all-mid-10k-sp 27.65 7.80 3.54 nnet_test 32.79 10.57 3.10 parser-125k 71.43 25.00 2.86 radix2-big-64k 2320.19 623.44 3.72 sha-test 555.56 227.27 2.44 zip-test 363.64 166.67 2.18 MARK RESULTS TABLE Mark Name MultiCore SingleCore Scaling ----------------------------------------------- ---------- ---------- ---------- CoreMark-PRO 18743.79 6306.76 2.97 

How to find out if you are CPU or IO bound

Or, if execution is quick, and you parametrize the number of threads, you can see it easily from time that performance improves as the number of threads increases for CPU bound work: What do 'real', 'user' and 'sys' mean in the output of time(1)?

RAM-IO bound: harder to tell, as RAM wait time it is included in CPU% measurements, see also:

GPUs

GPUs have an IO bottleneck when you first transfer the input data from the regular CPU readable RAM to the GPU.

Therefore, GPUs can only be better than CPUs for CPU bound applications.

Once the data is transferred to the GPU however, it can operate on those bytes faster than the CPU can, because the GPU:

Therefore the GPU can be faster then a CPU if your application:

These designs choices originally targeted the application of 3D rendering, whose main steps are as shown at What are shaders in OpenGL and what do we need them for?

and so we conclude that those applications are CPU-bound.

With the advent of programmable GPGPU, we can observe several GPGPU applications that serve as examples of CPU bound operations: