Like me you have likely encountered the Basic Multilingual Plane (BMP) [β] or simply Plane, the section of Unicode that includes most of the characters, symbols, and emojis we use every day. BMP is often what we think of when they refer to Unicode. Most of us know there is an extended version but likely never worked with them directly.
In this article I want to take a look at it through the lenses of working with emojis because thats was my first encounter with extended Unicode, I think.
Lets start with the Basic Multilingual Plane?
As I mentioned eariler, the Basic Multilingual Plane (BMP) is the first and most commonly used range of Unicode characters.
Unicode is a standard that assigns a unique code to every character, symbol in digital text. Characters within the BMP are encoded using hex values between 0000
to FFFF
.
01: a.textContent= "\u00aa" // a
Everything after \u
or u+
(tells the programming language or reader that its unicode) is a hex value that resolves to 4 bits in binary, e.g.: F
is 1111
X 4(for each character) thus 16-bits, meaning the computer stores each unicode character using 16-bits, anything outside of this falls under the SMP.
Without going in dept the reason you need to keep the 16-bit number in mind is the fact that most programming languages like Javascript handle strings using 16-bit or better known as
UTF-16
.
Whatβs Outside the BMP? Meet the Supplementary Multilingual Plane
Now that we know about BMP and what UTF-16 is all about, lets talk about Supplementary Multilingual Plane (SMP).
Characters that were later added were assigned unicode starting from U+10000
, and this includes emojis. require additional bits and are stored in Supplementary Planes. The Supplementary Multilingual Plane (SMP), one of these extended planes, contains characters such as:
- Additional emojis
- Rare or ancient scripts
- Special symbols for math, music, and technical fields
Lets talk about surrogate pairs
The crescent moon emoji (U+1F319
) is part of the SMP, not the BMP. Because of this, it cannot be represented with a simple \u
sequence, as it requires more than 16 bits. Instead, we use surrogate pairs in UTF-16
, like this:
01: a.textContent = "\uD83C\uDF19 Dark Mode"; // Displays the crescent moon emoji with Dark Mode text
How to Work with SMP Characters
When dealing with characters outside the BMP (like U+1F319
for π), you need to be aware of how theyβre encoded. Here are some quick guidelines:
- BMP characters:
U+0000
toU+FFFF
(e.g.,\u2600
for βοΈ) are easy to use with\u
escape sequences. - SMP and other planes:
U+10000
and above (e.g.,U+1F319
for π) require surrogate pairs inUTF-16
.
To check if a character is in the BMP or SMP, look at the code point:
- BMP: Code points up to
U+FFFF
(4 hex digits). - SMP and Beyond: Code points
U+10000
and above (5 hex digits).
Summary
- The BMP is Unicodeβs most commonly used range, covering characters from
U+0000
toU+FFFF
. - The SMP is an extended range for less common characters, starting at
U+10000
. - BMP characters are simple to work with using
\u
sequences, while SMP characters need special handling with surrogate pairs.
Understanding these two planes can help you manage emojis and special characters smoothly in your projects!
Here is another article you might like π A Look At The Basic Multilingual Plane (BMP)