Incomprehensible characters instead of text in the browser. Hieroglyph structure: features, graphemes, complex signs
Hello, dear readers, admirers and other good people!
Have you ever received and read letters in “whatever language you understand” or went to some Internet resource and instead of the usual letters you saw continuous gibberish? If yes, then this article is for you, because in it we will talk about page encoding, its formats, why it occurs and how to avoid incomprehensible hieroglyphs in the future.
So, today we are waiting for not a light software article, but a harsh technical one, so get ready: we will hit harsh realities a little.
Go.
What is text encoding and what is it used for?
I would like to start with the fact that this article might not have existed, because... The computer-using life of the author of these lines proceeded quite calmly and with dignity. But then one fine day, wandering around the Internet not from my PC, I came across strange phenomena on some sites. When I went to Internet resources, I saw not the familiar Russian alphabet and beautiful, understandable text, but some kind of heresy in the form of an incomprehensible sequence of symbols. She looked something like this (see image).
At first I thought that my beloved Mozilka (Firefox browser) had overheated and it was time for her to call an ambulance, but then I began to understand that the problem was most likely on the network resource side and it lay in incorrectly configured encoding. This really turned out to be the case, and after fiddling around a bit with a tambourine, the problem was promptly resolved. The result of all my love affairs is today’s material. Actually, let's go look into the details.
All information presented in digital form and located on the global web must be considered from two sides: the first - from the user’s side (beautiful and well-groomed text on the monitor screen) and the second - from the search engine side (a certain program code consisting of various tags/ meta tags, symbol tables, etc.).
If you are at least a little familiar with hypertext markup language (HTML), then you should be aware that the site through the eyes of search engines (Google, Yandex) is seen not as ordinary text, but as a structured document consisting of sequences of various kinds of tags. To make it clearer what I'm talking about, let's take a look at our favorite site Notes from Sys.Admin” of the project, but not through the eyes of an ordinary user, but through the “eyes” of a search engine. To do this, press the key combination Ctrl+U (for Firefox and Chrome browsers) and see the following picture (see image):
What we have before us is a machine version of a website, in this unpresentable form it is presented to search engines and it is in this form that they eat it. If we simply took and “sandalized” versions of articles from a notepad or Word with plain text, the machines would not only choke on it, they wouldn’t even eat it. So, we have the main page of the project in HTML form. Pay attention to the line that says UTF-8, this is nothing more than the notorious encoding of the page text, it is what is responsible for the format for displaying information in a presentable form, as a result of which we see normal text through the browser.
Now let's figure out why it happens that sometimes we see cracks on the monitor screen. It's very simple, the problem lies in opening a file in the wrong encoding. If we translate it into everyday language, then let’s say you were sent to the store for milk, and you scooped up some bread, which seems to be also edible, but a completely different product format.
So, now let's understand the theory and for this we will introduce some definitions.
- Encoding (or “Charset”) – correspondence between a set of characters and a set of numeric values. Needed to “leak” information onto the Internet, i.e. text information is converted into data bits;
- Code page (“Codepage”) – 1 byte (8 bit) encoding;
- The number of values accepted by 1 byte is 256 (two in an eighth).
The “symbol-image” correspondence is specified using special code tables, where each symbol is already assigned its own specific numeric code. There are quite a lot of such tables, and in different tables the same symbol can be identified differently (it can have different numeric codes).
All encodings differ in the number of bytes and the set of special characters into which each character of the source text is converted.
Note:
Decoding is an operation that results in the conversion of a symbol code into an image. As a result of this operation, information is displayed on the user's monitor screen.
In general.. We've sorted out the definitions, and now let's find out what kind of (encodings) there are.
Types of text encodings
And, in general, there are enough of them.
- ASCII
One of the most “ancient” is the American coding table (ASCII, read as “ask”), adopted by the National Standards Institute. For encoding, she used 7 bits, the first 128 values contained the English alphabet (in lower and upper case), as well as signs, numbers and symbols. It was more suitable for English-speaking users and was not universal.
- Cyrillic
A domestic version of the encoding, for which they began to use the second part of the code table - characters from 129 to 256. Designed for a Russian-speaking audience.
- MS Windows family encodings: Windows 1250-1258.
8-bit encodings appeared as a consequence of the development of the most popular operating system, Windows. Numbers from 1250 to 1258 indicate the language for which they are tailored, for example, 1250 - for the languages of central Europe; 1251 – Cyrillic alphabet.
- Information exchange code 8 bits – KOI8
KOI8-R, KOI8-U, KOI-7 – standard for Russian Cyrillic alphabet in Unix-like operating systems.
- Unicode
A universal character encoding standard that allows you to describe the characters of almost all written languages. Designation “U+xxxx” (xxxx – hexadecimal digits). The most common encoding families UTF (Unicode Transformation Format): UTF-8, 16, 32.
Currently, as they say, UTF-8 “rules” - it is it that provides the best compatibility with older operating systems that used 8-bit characters. The majority of sites on the Internet are in UTF-8 encoding and it is this standard that is universal (support for Cyrillic and Latin).
Of course, I did not list all types of encodings, but only the most popular ones. If you want to know them all for general development, then the full list can be found in the browser itself. To do this, just go to the “tab” View-Encoding-Select list” and get acquainted with all their possible options (see image).
I think a reasonable question has arisen: “ Why the hell are there so many encodings?" Their abundance and reasons for their occurrence can be compared to the phenomenon of cross-browser/cross-platform. This is when the same website is displayed differently in different Internet browsers and on different gadget devices. By the way, the site " Notes from Sys.Admin"With this, as you noticed, everything is in order :).
All these encodings are working options created by developers “to suit themselves” and solve their problems. When their number exceeded all reasonable limits, and search engines began to produce queries like: “ How to remove crappy bugs in the browser?” - the developers began to rack their brains to bring all this mess to a single standard, so that, so to speak, everyone would feel good. And the Unicode encoding, in general, did this “well”. Now, if such problems arise, they are local in nature, and only completely unenlightened users do not know how to fix them (however, often problems with the encoding and display of sites appear due to the fact that the webmaster specified an incorrect format on the server side, and you have to switch the encoding in the browser).
Well, actually, for now, all the “basically necessary” theory that will allow you not to “float” in coding issues, now let’s move on to the practical part of the article.
Solving problems with encoding or how to remove crappy codes?
So, our article would be incomplete if we did not touch on consumer and everyday issues. Let's look at them and start with how (with what) you can view the encoding?
At any operating system there is a symbol table, it does not need to be downloaded or installed - this is a given from above, which is located at the address: “Start-programs-standard-utilities-symbol table”. This is a table of vector shapes of all fonts installed on your operating system.
By selecting “additional parameters” (Unicode set) and the corresponding font type, you will see the full set of characters included in it. By clicking on any character, you will see its code in UTF-16 format, consisting of 4 hexadecimal digits (see image).
Now a few words about how to remove krakozabry. They can occur in two cases:
- From the user’s side - when reading information on the Internet (for example, when visiting a website);
- Or, as mentioned just above, on the part of the webmaster (for example, when creating/editing text files with support for the syntax of programming languages in the ++ program or due to the incorrect encoding specified in the site code).
Let's consider both options.
No. 1. Hieroglyphs from the user's side.
Let's say you launched the OS and in some of the applications you see the notorious scribbles. To fix this, go to: “ Start - Control Panel - Regional and Language Options - Change the language” and select from the list, “Russia”.
Also check in all tabs that the localization is “Russia/Russian” - this is the so-called system locale.
If you opened the site and suddenly realized that hieroglyphs do not allow you to read the information, then you should change the encoding using the browser (“View - Encoding”). On what? It all depends on the type of these krakozyabrs. Refer to the following cheat sheet (see image).
No. 2. Hieroglyphs from the webmaster's side.
Very often, novice website developers do not give of great importance encoding of the document being created, as a result of which they then encounter the above-mentioned problem. Here are some simple basic tips for webmasters to fix the problem.
To prevent this from happening, go to the Notepad++ editor and select “Encodings” from the menu. It is he who will help transform the existing document. The question is, which one? Most often (if the site is on WordPress or Joomla), then “ Convert to UTF-8 without BOM” (see image).
Having made such a conversion, you will see changes in the program status line.
Also, to avoid scams, it is necessary to force encoding information into the site header. Thus, you indicate to the browser that the site should be read in the prescribed encoding. A novice webmaster needs to understand that leapfrogs with encoding most often occur due to a mismatch between the server settings and the site settings, i.e. On the server, one encoding is registered in the database, and the site sends pages to the browser in a completely different one.
To do this, you need to write “blatantly” (in the site header, i.e., as often, in the header.php file) between the tags
the following line:By writing such a line, you will force the browser to correctly interpret the encoding, and the hieroglyphs will disappear.
You may also need to adjust the data output from the database (MySQL). This is done like this:
mysql_query("SET NAMES utf8");
myqsl_query("SET CHARACTER SET utf8");
mysql_query("SET COLLATION_CONNECTION="utf8_general_ci"" ");
Alternatively, you can also make a knight’s move and write the following lines in the .htaccess file:
# BEGIN UTF8
AddDefaultCharset utf-8
AddCharset utf-8 *
CharsetSourceEnc utf-8
CharsetDefault utf-8
#END UTF8All of the above methods (or some of them) will most likely help you and your future visitors get rid of hated hieroglyphs and encoding problems. Unfortunately, we won’t go into more detail here about the instructions for webmaster stuff; I think that they will definitely understand the details if they want (after all, we have a slightly different topic for the site).
Well, now the practical part of the article is finished, all that remains is to sum up some small results.
Afterword
Today we got acquainted with the concept of text encoding. I am sure that now, when scribbles appear on your computer monitor, you will not give up, but remember all the methods given here and resolve the issue in your favor!
That's all, thank you for your attention and see you again.
Set the character set
Meta tag
You need to add a special meta tag to each page (or header template) that tells the browser what set of characters to use to display texts. This tag is standard and usually looks like this:
charset=UTF-8» />
charset=”utf-8″/> (option for HTML 5)
You need to paste it into the section
- better at the very beginning, right after the opening one :
Meta encoding tag
Via .htaccess (if all else fails)
Usually the first two options are enough and browsers display the text how to. But some of them may have problems and therefore you can resort to help .htaccess file.
To do this, you need to write the following line in it:
AddDefaultCharset utf-8
That's all. If you apply sequentially these 3 methods of setting encoding on your project, then the likelihood is that that everything will be displayed as it should, close to 100%.
How to “see” what is hidden behind strange symbols on a website?
If you go to a web page, see “crazy words” and want to see normal text, then there are only two ways:
- inform the site owner so that everything is configured properly
- try to guess the encoding yourself. This is done using standard browser tools. In Chrome, for example, you need to click on the menu "Tools => Encoding" and from huge list choose the appropriate set of characters (i.e. guess).
Fortunately, almost all modern web projects are done in UTF-8 encoding, which is “universal” for different alphabets and therefore it is less and less likely to see these strange characters on the Internet.
Hi all!
I looked at how many people have problems with displaying text (it comes out in the form of hieroglyphs) and decided to write this article-note. It will be short and will simply tell you why this problem occurs in 99% of cases and how to solve it. Go.
It doesn’t matter at all whether you have your own website or a simple Word document. Hieroglyphs instead of normal Russian text can be displayed absolutely everywhere. But there is only one problem. Encoding. The most commonly used is utf-8, but sometimes they also like to use windows-1251. So, if your server runs on utf-8, and the site is designed for windows-1251, there will be hieroglyphs instead of text.
Solution: you need to save the site files with the encoding converted to utf-8 without BOM. There are 2 types, with and without BOM. The difference is that in simple utf-8 all sorts of extra values are substituted at the beginning of the line, which increases the weight and is not displayed as it should. Therefore, we use only without bom.
Also make sure that at the beginning of the site in the code, where the block is, the charset line will contain either utf-8, or this line will not exist at all. After which the site will begin to display as it should.
Hieroglyphs are displayed in documents instead of text
If you have the same problem with documents, change the encoding. Just experiment in this case. Put utf-8 first, if it doesn't work - windows-1251. If that doesn’t work, try another one from the Cyrillic encodings section. If it doesn't help, the file may be corrupted. Or there is some other problem. But in 99% of cases, changing the encoding helps.
The printer prints hieroglyphs instead of text
Also a fairly common problem. I think you have already guessed what to do. Go to the printer settings section. There should be a section called "Encodings" somewhere. We are looking for one like this. And let's see what the encoding is. If utf-8, try changing it to windows-1251. But if it is windows-1251, then we try to install utf-8.
That's all. Now we know why hieroglyphs appear instead of text and how to deal with it.
Krakozyabry- What kind of word is this interesting? This word is usually used by Russian users to describe the incorrect/incorrect display (encoding) of characters in programs or the Operating System itself.
Why does this happen? You won't find a definite answer. This may be due to the tricks of our “favorite” viruses, perhaps due to a malfunction of the Windows OS (for example, the electricity went out and the computer turned off), perhaps the program created a conflict with another OS and everything went haywire. In general, there can be many reasons, but the most interesting one is “It just broke down like that.”
Read the article and find out how to fix the problem with encoding in programs and Windows OS, once it has happened.For those who still don’t understand what I mean, here are a few:
By the way, I also found myself in this situation once and I still have a file on my desktop that helped me cope with it. That's why I decided to write this article.
Several “things” are responsible for displaying the encoding (font) in Windows - the language, the registry, and the files of the OS itself. Now we will check them separately and point by point.
How to remove and correct krakozyabry instead of Russian (Russian letters) in a program or Windows.
1. We check the installed language for programs that do not support Unicode. Maybe it's lost on you.
So, let's follow the path: Control Panel - Regional and Language Options - Advanced tab
There we make sure that the language is Russian.
In Windows XP, in addition to this, at the bottom there is a list of “Conversion table code pages” and in it there is a line with the number 20880. There needs to be a Russian there too
6. The last point in which I give you a file that helped me fix everything once and that’s why I left it as a keepsake. Here is the archive:
There are two files inside: krakozbroff.cmd and krakozbroff.regThey have the same principle - correct hieroglyphs, squares, questions or exclamation marks in programs and Windows OS (in common parlance) krakozyabry). I used the first one and it helped me.
And finally, a couple of tips:
1) If you work with the registry, then do not forget to make a backup (backup copy) in case something goes wrong.
2) It is advisable to check the 1st point after each point.That's all. Now you know how to fix/remove Crackers (squares, hieroglyphs, exclamation and question marks) in a program or Windows.