Bits, Bytes and Words

Organising Bits

Now we must put all this into context. As stated previously, all information on a computer is stored as binary digits. One binary digit is called a bit. A single bit is a very small amount of data. It turns out that there is a much more useful storage unit called a byte which happens to be 8 bits. This is very useful since 28 = 256 and therefore a single byte can represent one of 256 different values (0 to 255 inclusive). I.e.

00000000B = 0 (decimal)
11111111B = 255 (decimal)

(As an aside, I should point out that there are not always 8 bits to a byte. It turns out that for some computers, a byte can be a different number of bits. For this reason, when referring to a context where the number of bits to a byte may not be 8, such as networking, the term octet is preferred. An octet is always 8 bits. However, we shall ignore the exceptions here as they are fairly inconsequential, and always refer to 8 contiguous bits as a byte.)

An Example: The ASCII Character Set

A common example of the use of a single byte is defining characters. For example, consider the extended ASCII (which stands for American Standard Code for Information Interchange) character set. This character set is a table that stores 256 different characters which includes standard alphanumeric characters, punctuation symbols, and various other special characters and control-codes. MS-DOS and console-based applications in Windows use this character set to display characters to the screen. In the extended-ASCII set each character is defined by a value between 0 and 255. Since this range of numbers can be represented by just 8 bits, any given character will only occupy a single byte of memory.

(As an aside, it should be noted that the original ASCII character set used only 7 bits, not 8, with the result that only 128 characters, 32 of which are control-codes, can be encoded. Adding the 8th bit extended this set so that many extra characters could be encoded, such as the symbols used to draw boxes in the good ol' DOS days! To allow for different languages, the ASCII set can be 'adapted' slightly, depending on the characters you wish to use. These adaptations are called code pages. For example, code page 437 is the UK English adaptation. To allow the use of any language using just one character set, including vast alphabets such as the Japanese kanji which comprises over 2000 characters, modern Windows operating systems use yet another character set called Unicode. The standard Unicode implementation uses 2 bytes - that's 16 bits - per character, allowing the representation of 65536 different characters!)

(You can find the full ASCII character set here!)

Other terms

Along with the byte, a few other terms you may run across include the nybble and the word. The definitions of these terms are shown below.

1 nybble= 4 bits
1 byte = 8 bits = 2 nybbles
1 word = 2 contiguous bytes *
1 double word = 2 contiguous words

Still... the byte is still a very small unit of storage. We are much more accustomed to working with units such as kilobytes (kB), megabytes (MB) and gigabytes (GB). Some confusion arises because in the decimal system, the prefix kilo represents 103 or 1000. However, from an IT point of view, the prefix kilo actually represents 210 which is 1024. Since 1024 and 1000 are fairly similar values, the difference is often ignored. Similarly, the prefix mega represents 106 (1 million) in decimal, but 220 (1048576) from a binary standpoint.

1 kilobyte (kB) = 210 (1024) bytes
1 megabyte (MB) = 220 (~1 million) bytes
1 gigabyte (GB) = 230 (~1000 million) bytes
1 terabyte (TB) = 240 (~1 million million) bytes
1 petabyte (PB) = 250 (~1x1015) bytes

To further complicate matters, we now have new prefixes such as kibi, mebi, gibi, and so on. Technically, these terms are supposed to be used as binary multipliers (i.e. 2 to the power of 10, 20, 30 and so on), while the traditional decimal equivalents - kilo, mega, giga - should be used when we are actually referring to 10 to the power of 3, 6, 9, etc. However, this terminology has yet to catch on, and I'm not sure if it ever will, since the decimal prefixes are used so commonly to represent the binary values the world over.

My advice is this: if you see a decimal prefix and 'byte' intermixed, it's fairly safe to assume we're talking about binary multiples rather than decimal, i.e. 1024, not 1000. Technically, it's not right, but that's the way it is!

A Final Note on Terminology and Abbreviations...

An astute reader recently pointed out that I have erroneously been using the abbreviation Mb to represent a megabyte. Myself, like many others, have previous seen Mb and MB (i.e. with a capital B) as being completely interchangeable. However, strictly speaking, a lowercase b should be used when referring to bits, while an uppercase B should be used when referring to bytes. Hence, you may have 512MB of memory, but your broadband connection would have a bandwidth of 1Mb per second, i.e. 1 megabit.

I have always tried to avoid this confusion by explicitly stating Mbit when I'm referring to bits. However, we often see bandwith written as Mbps and kbps. In these cases, the lowercase b is indeed referring to bits, not bytes.

It's also worth noting that many people trying to make a buck will happily exploit this ambiguity, so it's worth double-checking to see exactly what they're offering.

Advanced (for techies): Cautious Words and Memory Addressing

This section is a bit more complicated than what has gone before. Feel free to skip over this if it doesn't make much sense...

* The word. Hmmm.

It should be noted that, technically, the definition of a word is dependent on the architecture of the computer. While the smallest unit of memory typically addressable by a class of computer is the byte, it is convenient to be able to retrieve more than one byte from memory in any one fetch step. This may be two bytes, 4 bytes, or even more. It is the size of this unit that defines a word, which will clearly vary from one class of computer to another. Hence, a word may be defined as the number of bits stored/retrieved in one access of memory by the CPU.

However, in the x86 CPU world it is commonly assumed that a word represents 2 bytes, even though modern CPUs have a data bus that supports a much larger word size.

A given memory address, then, does not usually refer to the address of a particular byte in memory, but rather the address of the first byte of a word in memory. While words (in the x86 world) are two byte chunks, this does not mean that word addresses are each two bytes apart. Word addresses normally are one byte apart, as the following diagram shows:

Memory addressing

What's next

The last part of this section is a discussion of hexadecimal and octal.