Unicode: Character Encoding Standard

Language ➪ Unicode: Character Encoding Standard

Unicode is a universal character encoding standard that provides a unique code point for every character in every language. Unicode allows representation and processing of text from virtually any writing system in the world, including Latin, Hebrew, Greek, Cyrillic, Arabic, Chinese, Japanese, modern and historic scripts, mathematical symbols, emojis, and more.

Unicode is maintained by the Unicode Consortium that was founded in 1988. Unicode has been adopted by web browsers and operating systems and ensures global compatibility and eliminates ambiguity in text representation.

The most common encoding formats include UTF-8, UTF-16, and UTF-32, with UTF-8 being the most widely used. UTF stands for UCS (Unicode) Transformation Format .UTF-8 uses 8-bit variable-width character encodings. UTF-8 uses between 1 and 6 bytes to encode a character; it may use fewer, the same, or more bytes than UTF-16 to encode the same character. In UTF-8, every code point from 0 to 127 (U+0000 to U+0127) is stored in a single byte. Only code points 128 (U+0128) and above are stored using 2 to 6 bytes.

Key concepts

Code Points: Each character in Unicode is assigned a unique number called a code point. Code points are written as "U+" followed by a number or letters. For example, the letter "A" has the code point U+0041. The original version of Unicode was based on 16 bits (2 bytes) that allowed the coding of 65,536 characters. Version 2.0 of Unicode increased the range to 10FFFF. This range is grouped in planes of 65,536 code points per plane. Unicode contains 1,114,112 code points; currently, characters are assigned to more than 96,000 of them.

You can lookup the Unicode of a character on this page.

Planes: The Unicode code space is divided into 17 planes, with each plane containing 65,536 code points. The first plane, plane 0, covers code points from U+0000 to U+FFFF and is called the Basic Multilingual Plane (BMP). The majority of commonly used characters are in the BMP.

The second Plane covers code points from U+10000 to U+1FFFF and is called Supplementary Multilingual Plane (SMP).

Blocks: Blocks are named ranges of code points that encompass characters related to a specific script (e.g., "Latin", "Cyrillic", "Greek"), a set of symbols (e.g., "Arrows", "Mathematical Operators"), or a specific purpose (e.g., "Control Pictures").