UTF-8 on Jefferson Oliveira

A Evolução da Escrita Digital - Do ASCII às Maravilhas do UTF-8

Thu, 07 Aug 2025 10:29:04 -0300

Já imaginou como computadores transformam letras, números e emojis em zeros e uns que eles possam entender? Assim como nós, humanos, atribuímos significados às letras do alfabeto, o computador faz o mesmo. Vamos explorar aqui dois dos padrões mais populares de codificação de texto.

ASCII

Desenvolvido na década de 1960, o ASCII (American Standard Code for Information Interchange) tem uma premissa bem simples: usando apenas 7 bits, consegue-se representar 127 números, deixando reservados os primeiros 32 números da sequência para comandos importantes de escrita. O restante é preenchido com letras, números e alguns caracteres de pontuação.

As pessoas envolvidas no desenvolvimento do padrão fizeram de tal forma que a sequência do alfabeto pudesse ajudar na decodificação.

Por exemplo, o caractere do número “0” é o número 48, que, representado em 7 bits, fica 011 0000. Assim como:

1 → 011 0001

2 → 011 0010

3 → 011 0011

Se notarem, os últimos 4 bits estão em sequência. Logo, para descobrir em ASCII qual o inteiro, é só subtrair 011 0000 (decimal: 48).

Da mesma forma, as letras do alfabeto: “A” é 65 → 100 0001 e o “a” é 97 → 110 0001. Com isso, era possível codificar todas as letras do alfabeto inglês 🇬🇧.

Interactive Observable notebook widget — view it on the blog.

… enquanto isso no resto do mundo …

Cena da série animada South Park

Como já era de imaginar, com o avanço da tecnologia e da capacidade dos computadores, cada país utilizou essa capacidade extra para codificar seus próprios caracteres. O Japão, por exemplo, nem o ASCII usou. Outros codificadores, como o Shift JIS, utilizavam múltiplos bytes, e com tudo isso gerou-se uma gigantesca incompatibilidade.

Curiosidade: No Japão, existe a palavra mojibake (文字化け), que significa “caractere distorcido”. Isso acontecia devido aos problemas de codificação entre todos os alfabetos japoneses e também o latino.

Porém, mesmo com toda essa incompatibilidade durante os anos 1980 e 1990, quais eram as chances de uma empresa de Londres ter que mandar documentos constantemente para o Japão? Naquela época, a solução era simples: imprima e envie por fax!

Cena da série animada Os Simpsons

Então chegou a internet, e o que era ruim ficou ainda pior… Agora temos que lidar com documentos sendo enviados pela internet constantemente, e com o tempo foi formado o:

Unicode Consortium

E como em um evento que pode se chamar de milagre do bom senso, durante as últimas décadas, foi formado um padrão com 154.998 caracteres, que cobre toda e qualquer língua que você possa imaginar: árabe, japonês, cirílico, chinês, coreano e até hieróglifos egípcios.

O que eles fizeram de forma simplificada foi pegar centenas de milhares de números e atribuí-los a centenas de milhares de caracteres, ou seja, o número 35307 representará o caractere japonês 觫, o número 963 representará σ e assim por diante.

UTF-8

Perfeito, agora nós temos centenas de milhares de números para representar todo e qualquer caractere, mas como vamos fazer isso com binário?

Para representar um número nessas proporções, vamos precisar de pelo menos 32 bits para representar qualquer número dessa magnitude, o que agora trouxe problemas para o alfabeto inglês, porque o Unicode é compatível com ASCII, ou seja, “A” ainda é 65 e “a” ainda é 97. Mas quando olhamos para o binário de 32 bits desses números, agora usamos 4x mais espaço para representar os mesmos caracteres.


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a

Contando acima, há 25 zeros seguidos que estarão presentes em todo texto que utilizar caracteres latinos, e esse é só o primeiro dos nossos problemas. O segundo é que alguns sistemas antigos interpretam uma sequência de 8 zeros [NULL] como fim de caractere, o famoso \0 do C.

Então entra o UTF-8. A primeira coisa é: se a letra tiver numeração abaixo de 127, então você representa exatamente igual ao ASCII.

Logo, o primeiro problema está resolvido: “A” ainda é 65 e cabe em 8 bits. 01000001.

E para números maiores que 127? Para isso você vai quebrar seu binário em 2 bytes.

1	2
`110xxxxx`	`10xxxxxx`

No byte 1, você tem o cabeçalho 110, que significa que esse caractere foi quebrado em 2 bytes. No byte 2, você começa com o cabeçalho de continuidade 10. Todos os outros bits restantes você vai preencher com o número que você quer representar.

Para calcular, é só remover os cabeçalhos, unir todos os bits e o número resultante é o caractere Unicode. Você pode fazer isso até 4096. Passou disso? Sem problemas! Usando o cabeçalho 1110 + 2 bytes, você tem 16 bits.

1	2	3
`1110xxxx`	`10xxxxxx`	`10xxxxxx`

Quer ir além? Tudo bem! O padrão suporta até o cabeçalho 1111110x + 6 bytes de continuidade.

Codificando UTF-8

Interactive Observable notebook widget — view it on the blog.

É incrível como esse padrão consegue entregar:

É compatível com os sistemas anteriores;
Não gasta espaço;
E em nenhum momento na vida haverá 8 zeros seguidos em nenhuma parte de qualquer byte.

Além disso, outra razão que o fez se tornar o padrão mundial hoje em dia é que, para se mover entre caracteres, se você não sabe onde está, é só procurar o próximo cabeçalho, não precisa de índice.

Já fazem alguns anos que o UTF-8 virou o padrão em toda comunicação pela internet, e o fato de hoje a pessoa japonesa média não precisar se preocupar com mojibake mais é por causa desse método genial de codificar texto.

Referencias

The Evolution of Digital Writing - From ASCII to the Wonders of UTF-8

Thu, 07 Aug 2025 10:29:04 -0200

Hey Y’all! This is a translation of my blog post originally written in Portuguese.

Have you ever wondered how computers transform letters, numbers, and emojis into zeros and ones that they can understand? Just like us humans assign meanings to letters of the alphabet, the computer does the same. Let’s explore here two of the most popular text encoding standards.

ASCII

Developed in the 1960s, ASCII (American Standard Code for Information Interchange) has a very simple premise: using only 7 bits, you can represent 127 numbers, reserving the first 32 numbers in the sequence for important writing commands. The rest is filled with letters, numbers, and some punctuation marks.

The people involved in developing the standard did it in such a way that the alphabet sequence could help with decoding.

For example, the character for the number “0” is 48, which, represented in 7 bits, becomes 011 0000. Just like:

1 → 011 0001

2 → 011 0010

3 → 011 0011

If you notice, the last 4 bits are in sequence. So, to find out what the integer is in ASCII, you just subtract 011 0000 (decimal: 48).

In the same way, the letters of the alphabet: “A” is 65 → 100 0001 and “a” is 97 → 110 0001. With this, it was possible to encode all the letters of the English alphabet 🇬🇧.

Interactive Observable notebook widget — view it on the blog.

… meanwhile in the rest of the world …

Scene from South Park series

As you might imagine, with the advancement of technology and computer capacity, each country used this extra capacity to encode their own characters. Japan, for example, didn’t even use ASCII. Other encoders, like Shift JIS, used multiple bytes, and with all this, a gigantic incompatibility was generated.

Fun fact: In Japan, there’s the word mojibake (文字化け), which means “distorted character”. This happened due to encoding problems between all Japanese alphabets and also the Latin one.

However, even with all this incompatibility during the 1980s and 1990s, what were the chances of a London company having to constantly send documents to Japan? At that time, the solution was simple: print and send by fax!

Scene from The Simpsons Series

Then the internet came, and what was bad got even worse… Now we have to deal with documents being constantly sent over the internet, and over time the following was formed:

Unicode Consortium

And as in an event that could be called a miracle of common sense, over the last few decades, a standard was formed with 154,998 characters, covering every language you can imagine: Arabic, Japanese, Cyrillic, Chinese, Korean, and even Egyptian hieroglyphs.

What they did in a simplified way was take hundreds of thousands of numbers and assign them to hundreds of thousands of characters, that is, the number 35307 will represent the Japanese character 觫, the number 963 will represent σ, and so on.

UTF-8

Perfect, now we have hundreds of thousands of numbers to represent every possible character, but how are we going to do this with binary?

To represent a number in these proportions, we’ll need at least 32 bits to represent any number of that magnitude, which now brought problems for the English alphabet, because Unicode is compatible with ASCII, meaning “A” is still 65 and “a” is still 97. But when we look at the 32-bit binary of these numbers, we now use 4x more space to represent the same characters.


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a

Counting above, there are 25 consecutive zeros that will be present in every text that uses Latin characters, and that’s just the first of our problems. The second is that some old systems interpret a sequence of 8 zeros [NULL] as the end of a character, the famous \0 in C.

So UTF-8 comes in. The first thing is: if the letter has a number below 127, then you represent it exactly the same as ASCII.

So the first problem is solved: “A” is still 65 and fits in 8 bits. 01000001.

And for numbers greater than 127? For that, you’ll break your binary into 2 bytes.

1	2
`110xxxxx`	`10xxxxxx`

In byte 1, you have the header 110, which means this character was broken into 2 bytes. In byte 2, you start with the continuation header 10. All other remaining bits you’ll fill with the number you want to represent.

To calculate, just remove the headers, join all the bits, and the resulting number is the Unicode character. You can do this up to 4096. Beyond that? No problem! Using the header 1110 + 2 bytes, you have 16 bits.

1	2	3
`1110xxxx`	`10xxxxxx`	`10xxxxxx`

Want to go further? That’s fine! The standard supports up to the header 1111110x + 6 continuation bytes.

Encoding UTF-8

Interactive Observable notebook widget — view it on the blog.

It’s amazing how this standard manages to deliver:

It’s compatible with previous systems;
It doesn’t waste space;
And at no point in life will there be 8 consecutive zeros in any part of any byte.

Additionally, another reason that made it become the world standard today is that, to move between characters, if you don’t know where you are, you just look for the next header, you don’t need an index.

It’s been several years since UTF-8 became the standard in all internet communication, and the fact that today the average Japanese person doesn’t need to worry about mojibake anymore is because of this brilliant method of encoding text.


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a

UTF-8 on Jefferson Oliveira

A Evolução da Escrita Digital - Do ASCII às Maravilhas do UTF-8

ASCII

Unicode Consortium

UTF-8

Codificando UTF-8

Referencias

The Evolution of Digital Writing - From ASCII to the Wonders of UTF-8

ASCII

Unicode Consortium

UTF-8

Encoding UTF-8

References


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a


0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1		A
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1		a