Unicode in Java: Understanding Character Encoding

Rumman Ansari   Software Engineer   2024-07-04 05:03:29   11364  Share
Subject Syllabus DetailsSubject Details 2 Program
☰ TContent
☰Fullscreen

Table of Content:

Unicode System in Java

Encoding :

Computers use binary numbers internally. A character is stored in a computer as a sequence of 0s and 1s. Mapping a character to its binary representation is called encoding. There are different encoding technique:

  1. ASCII (American Standard Code for Information Interchange) for the United States.
  2. ISO 8859-1 for Western European Language.
  3. ISO-Latin-1 ranges from 0 to 255
  4. KOI-8 for Russian.
  5. GB18030 and BIG-5 for Chinese, and so on.

ASCII

Most computers use ASCII (American Standard Code for Information Interchange), an 8-bit encoding scheme for representing all uppercase and lowercase letters, digits, punctuation marks, and control characters.

Below is a example from C programming Language

#include"stdio.h" 
int main()
{
char a= 'A'; // a is character type variable 
printf("%d \n",a);   
return 0;
}  

Output :
ASCII value of capital A is 65

65
Press any key to continue . . .

Disadvantages or problem of ASCII

The standard set of characters known as ASCII ranges from 0 to 127. In C/C++, char is 8 bits wide. ASCII was originally designed as an 8-bit character encoding. The primitive data type char was intended to take advantage of this design by providing a simple data type that could hold any character.However, it turned out that the 256 characters possible in an 8-bit encoding are not sufficient to represent all the characters in the world.

The encodings for languages with large character sets have a variable length. Some common characters are encoded as single bytes, other require two or more byte.

Solution

To solve these problems, a new language standard was developed i.e. Unicode System. In Unicode, the character holds 2 bytes, so Java also uses 2 bytes for characters.


Unicode

Unicode, an encoding scheme established by the Unicode Consortium to support the interchange, processing, and display of written texts in the world’s diverse languages.

Java uses Unicode to represent characters. Unicode defines a fully international character set that can represent all of the characters found in all human languages. It is a unification of dozens of character sets, such as Latin, Greek, Arabic, Cyrillic, Hebrew, Katakana, Hangul, and much more.For this purpose, it requires 16 bits. Thus, in Java char is a 16-bit type. The range of a char is 0 to 65,536. There are no negative chars.

The Unicode for the Greek letters alpha, beta, gamma are \u03b1 \u03b2 \u03b3 If no Chinese font is installed on your system, you will not be able to see the Chinese char- acters.

Example

  public class UnicodeSystem {
    public static void main(String[] args) {
         char c = '\u0077';
		     System.out.println(c);
  }
}

Outout :
\u007 is associated with 'w' character

w
Press any key to continue . . . 


Disadvantages or problem of Unicode

Unicode was originally designed as a 16-bit character encoding. The primitive data type char was intended to take advantage of this design by providing a simple data type that could hold any character. However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all the characters in the world.

Solution

The Unicode standard, therefore, has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters. Java supports the supplementary characters.