What is ASCII?
ASCII (American Standard Code for Information Interchange) is the most common character encoding format for text data in computers and on the internet. In standard ASCII-encoded data, there are unique values for 128 alphabetic, numeric or special additional characters and control codes.
ASCII encoding is based on character encoding used for telegraph data. The American National Standards Institute first published it as a standard for computing in 1963.
Characters in ASCII encoding include upper- and lowercase letters A through Z, numerals 0 through 9 and basic punctuation symbols. It also uses some non-printing control characters that were originally intended for use with teletype printing terminals.
ASCII characters may be represented in the following ways:
- as pairs of hexadecimal digits -- base-16 numbers, represented as 0 through 9 and A through F for the decimal values of 10-15;
- as three-digit octal (base 8) numbers;
- as decimal numbers from 0 to 127; or
- as 7-bit or 8-bit binary
For example, the ASCII encoding for the lowercase letter "m" is represented in the following ways:
|Character||Hexadecimal||Octal||Decimal||Binary (7 bit)||Binary (8 bit)|
|m||0x6D||/155||109||110 1101||0110 1101|
ASCII characters were initially encoded into 7 bits and stored as 8-bit characters with the most significant bit -- usually, the left-most bit -- set to 0.
Why is ASCII important?
ASCII was the first major character encoding standard for data processing. Most modern computer systems use Unicode, also known as the Unicode Worldwide Character Standard. It's a character encoding standard that includes ASCII encodings.
The Internet Engineering Task Force (IETF) adopted ASCII as a standard for internet data when it published "ASCII format for Network Interchange" as RFC 20 in 1969. That request for comments (RFC) document standardized the use of ASCII for internet data and was accepted as a full standard in 2015.
ASCII encoding is technically obsolete, having been replaced by Unicode. Yet, ASCII characters use the same encoding as the first 128 characters of the Unicode Transformation Format 8, so ASCII text is compatible with UTF-8.
In 2003, the IETF standardized the use of UTF-8 encoding for all web content in RFC 3629.
Almost all computers now use ASCII or Unicode encoding. The exceptions are some IBM mainframes that use the proprietary 8-bit code called Extended Binary Coded Decimal Interchange Code (EBCDIC).
How does ASCII work?
ASCII offers a universally accepted and understood character set for basic data communications. It enables developers to design interfaces that both humans and computers understand. ASCII codes a string of data as ASCII characters that can be interpreted and displayed as readable plain text for people and as data for computers.
Programmers use the design of the ASCII character set to simplify certain tasks. For example, using ASCII character codes, changing a single bit easily converts text from uppercase to lowercase.
The capital letter "A" is represented by the binary value:
The lowercase letter "a" is represented by the binary value:
The difference is the third most significant bit. In decimal and hexadecimal, this corresponds to:
The difference between upper- and lowercase characters is always 32 (0x20 in hexadecimal), so converting from upper- to lowercase and back is a matter of adding or subtracting 32 from the ASCII character code.
Similarly, hexadecimal characters for the digits 0 through 9 are as follows:
Using this encoding, developers can easily convert ASCII digits to numerical values by stripping off the four most significant bits of the binary ASCII values (0011). This calculation can also be done by dropping the first hexadecimal digit or by subtracting 48 from the decimal ASCII code.
Developers can also check the most significant bit of characters in a sequence to verify that a data stream, string or file contains ASCII values. The most significant bit of basic ASCII characters will always be 0; if that bit is 1, then the character is not an ASCII-encoded character.
ASCII variants and Unicode
When it was first introduced, ASCII supported English language text only. When 8-bit computers became common during the 1970s, vendors and standards bodies began extending the ASCII character set to include 128 additional character values. Extended ASCII incorporates non-English characters, but it is still insufficient for comprehensive encoding of text in most world languages, including English. Different extended ASCII character sets are common, depending on the vendor, language and country.
Initially, other character encoding standards were adopted for other languages. In some cases, the standards were designed for other countries with different requirements. In other cases, the encodings were hardware manufacturers' proprietary designs.
Unicode defines codespaces for the implementation of character encodings for different languages. Characters can be mapped to encodings using one of the following two methods:
- Universal Coded Character Set (UCS)
Depending on the language and the mapping used, characters can be expressed in one to four 8-bit bytes (UTF-8), in two 16-bit units (UTF-16) or in a single 32-bit unit (UTF-32).
The UCS standard is maintained as an ISO (International Organization for Standardization) standard, ISO/IEC 10646. As of this writing, there are 143,859 different characters defined in version 13.0 of the Unicode standard.
ASCII advantages and disadvantages
After more than half a century of use, the advantages and disadvantages of using ASCII character encoding are well understood. That is one of the encoding format's great strengths.
- Universally accepted. ASCII character encoding is universally understood. Except for the IBM mainframes that use EBCDIC encoding, it is universally implemented in computing through the Unicode standard. Unicode character encoding replaces ASCII encoding, but it is backward-compatible with ASCII.
- Compact character encoding. Standard codes can be expressed in 7 bits. This means data that can be expressed in the standard ASCII character set requires only as many bytes to store or send as the number of characters in the data.
- Efficient for programming. The character codes for letters and numbers are well adapted to programming techniques for manipulating text and using numbers for calculations or storage as raw data.
- Limited character set. Even with extended ASCII, only 255 distinct characters can be represented. The characters in a standard character set are enough for English language communications. But even with the diacritical marks and Greek letters supported in extended ASCII, it is difficult to accommodate languages that do not use the Latin alphabet.
- Inefficient character encoding. Standard ASCII encoding is efficient for English language and numerical data. Representing characters from other alphabets requires more overhead such as escape codes.
Converting text to ASCII code in Windows
There is more than one way to display text as ASCII codes in Windows. To use the Windows PowerShell command Format-Hex to display ASCII encoding for a text file, perform the following steps.
Open the Windows PowerShell application. Click on the search box in the lower left of your Windows 10 desktop. Type PowerShell and click on the PowerShell icon to start the application.
Format-Hex command. Enter the following command to display the ASCII encoding for a file called hello.txt in the c:\Users\userID\Documents directory:
View output. ASCII encoding for the file hello.txt will be displayed as in Figure 1 below. The top of the output shows that data is displayed in 16 columns, with one character per column. A running count of characters, in hexadecimal, is displayed along the left side of the output. In this case, in the last line, there are 0x60 (or 96 in decimal) characters at the start of the last line. ASCII encoding for the file's characters are shown in a grid 16 characters wide, with encoding in two-digit hexadecimal values. The original contents of the file are displayed to the right in 16 character groupings.
Note that the original file has two spaces (ASCII 0x20) followed by a CR (carriage return, ASCII 0x0D) and LF (line feed, ASCII 0A) characters. The CR-LF combination is used in ASCII files to show the end of a line.
Other options. Format-Hex can be used with other commands for easier command-line viewing of larger files. For example, the following command is used to page through ASCII encoding of a large file:
Format-Hex .\hello-long.txt | more
The output will look similar Figure 2, and you can view output one page at a time.
ASCII code tables
As originally composed, ASCII includes 32 non-printing control codes. Many of those codes were to control the teletypewriter terminal devices used in the early days of computing to input and output data. The remaining 96 printing characters include the DEL (delete) and SPACE (space) characters, as well as all 26 letters of the alphabet, upper- and lowercase, numerals and punctuation symbols.
Extended ASCII uses 8 bits, creating a set of 127 additional characters. There is no single extended ASCII character set. These sets may differ depending on the operating system or vendor. Extended ASCII character sets typically include symbols, letters with diacritical marks, graphical markings, and mathematical symbols including some Greek letters.
Non-printing ASCII control codes
The ASCII values for 0 through 31 (binary: 000 0000 through 001 1111) are non-printing control codes. They were originally intended for controlling the flow of data and include codes that show the end or beginning of data components, codes to control or show the state of hardware used for data transmission, and codes for positioning of the cursor pointer in a data stream.
Printing ASCII characters
Although the ASCII codes for DEL and SPACE are non-printing, they are considered part of the printing characters, as they are used when sending character streams. The standard ASCII character set includes binary values from 0 (000 0000) through 127 (111 1111).
Extended ASCII characters
The standard ASCII character set is only 7 bits, and characters are represented as 8-bit bytes with the most significant bit set to 0. Modern computers almost universally use 8-bit bytes, and the extended ASCII character set includes 127 more 8-bit characters, where the most significant bit is set to 1. The extended ASCII characters includes the binary values from 128 (1000 0000) through 255 (1111 1111).
Unlike standard ASCII characters, there are multiple versions of the extended ASCII character set. Table 3 lists Microsoft's Windows-1252 character encoding of the Latin alphabet. This is the default extended ASCII character set for Windows that American and British English and other European languages use.
ASCII characters can be combined graphically to create an image. ASCII art is a common technique for creating graphical images on text-only media like a computer terminal or text-only printer. For example:
This is a simple example of ASCII art. Much more elaborate images are possible when using more lines and more characters, especially from extended ASCII character sets.
The FTP ascii command
The File Transfer Protocol (FTP) has an ascii command that is used to enable the transfer of ASCII-encoded files. When transferring files in ASCII mode in FTP, the receiving host may change the file so it will be formatted as ASCII on the destination host.
When FTP transfers files using the binary mode, those files are not changed in any way.
The future of ASCII character encoding
After more than half a century, and despite being subsumed into the Unicode standard, ASCII character encoding for universally accessible computer and network data is unrivaled. Given the need to preserve data stored over the past decades, ASCII will continue to be foundational for all computing.