Convert text to various Unicode formats or restore Unicode to text.
Unicode is a universal character encoding standard that assigns a unique code point to every character. It allows computers to consistently represent and manipulate text in most of the world's writing systems.
Unicode supports over 140,000 characters covering 150 modern and historic scripts, as well as symbols, emoji, and other notations. Unicode escape sequences are ways to represent these characters in programming languages and data formats.
Unicode assigns each character a unique code point, which is a numerical value. These code points are typically written in hexadecimal format, prefixed with 'U+', such as U+0041 for the letter 'A'.
UTF-8: Variable-length encoding (1-4 bytes per character). Efficient for ASCII text, as ASCII characters use just 1 byte.
UTF-16: Uses 2 or 4 bytes per character. Common in many programming environments (JavaScript, Java, .NET).
UTF-32: Fixed 4 bytes per character. Simpler to process but uses more storage.
When decoding, the software reads the encoded bytes and maps them back to the corresponding Unicode code points, then displays the appropriate characters.
Different programming languages use different formats for Unicode escape sequences:
JavaScript: \u0041 for BMP characters, \u{1F600} for others
Python: \u0041 for BMP, \U0001F600 for others
C#: \u0041 for BMP, surrogate pairs for others
Let's examine how different characters are represented in Unicode:
U+0048 U+0069
48 69
00 48 00 69
\u0048\u0069