Understanding encoding schemes











up vote
0
down vote

favorite












I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?










share|improve this question
























  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?
    – deceze
    Nov 22 at 8:54










  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters
    – David
    Nov 22 at 9:06










  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.
    – Tom Blodget
    Dec 2 at 15:21















up vote
0
down vote

favorite












I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?










share|improve this question
























  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?
    – deceze
    Nov 22 at 8:54










  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters
    – David
    Nov 22 at 9:06










  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.
    – Tom Blodget
    Dec 2 at 15:21













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?










share|improve this question















I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?







encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 at 9:15

























asked Nov 22 at 8:51









David

33




33












  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?
    – deceze
    Nov 22 at 8:54










  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters
    – David
    Nov 22 at 9:06










  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.
    – Tom Blodget
    Dec 2 at 15:21


















  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?
    – deceze
    Nov 22 at 8:54










  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters
    – David
    Nov 22 at 9:06










  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.
    – Tom Blodget
    Dec 2 at 15:21
















I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?
– deceze
Nov 22 at 8:54




I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?
– deceze
Nov 22 at 8:54












Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters
– David
Nov 22 at 9:06




Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters
– David
Nov 22 at 9:06












#4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.
– Tom Blodget
Dec 2 at 15:21




#4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.
– Tom Blodget
Dec 2 at 15:21












3 Answers
3






active

oldest

votes

















up vote
1
down vote













1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



http://unicode.org



You probably won't need anything else.



... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



You can roll your own code based on the unicode documentation, or use the official unicode library:



http://site.icu-project.org/home



Hope this helps.






share|improve this answer






























    up vote
    0
    down vote













    In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



    One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



    To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



    Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






    share|improve this answer




























      up vote
      0
      down vote













      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



      The domain of the mapping defines which characters can be encoded.



      Now to your questions:




      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

      2. Each encoding may define its own set of characters and how they are mapped to bytes

      3. no, there are others as well ASCII, ISO-8859-1, ...

      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






      share|improve this answer























      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
        – Tom Blodget
        Nov 22 at 15:30











      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53427032%2funderstanding-encoding-schemes%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote













      1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



      Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



      There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



      2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



      See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



      As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



      3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



      Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



      4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



      Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



      I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



      Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



      I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



      If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



      http://unicode.org



      You probably won't need anything else.



      ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



      You can roll your own code based on the unicode documentation, or use the official unicode library:



      http://site.icu-project.org/home



      Hope this helps.






      share|improve this answer



























        up vote
        1
        down vote













        1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



        Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



        There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



        2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



        See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



        As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



        3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



        Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



        4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



        Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



        I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



        Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



        I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



        If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



        http://unicode.org



        You probably won't need anything else.



        ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



        You can roll your own code based on the unicode documentation, or use the official unicode library:



        http://site.icu-project.org/home



        Hope this helps.






        share|improve this answer

























          up vote
          1
          down vote










          up vote
          1
          down vote









          1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



          Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



          There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



          2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



          See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



          As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



          3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



          Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



          4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



          Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



          I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



          Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



          I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



          If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



          http://unicode.org



          You probably won't need anything else.



          ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



          You can roll your own code based on the unicode documentation, or use the official unicode library:



          http://site.icu-project.org/home



          Hope this helps.






          share|improve this answer














          1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



          Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



          There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



          2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



          See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



          As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



          3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



          Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



          4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



          Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



          I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



          Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



          I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



          If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



          http://unicode.org



          You probably won't need anything else.



          ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



          You can roll your own code based on the unicode documentation, or use the official unicode library:



          http://site.icu-project.org/home



          Hope this helps.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 3 at 22:11

























          answered Dec 2 at 8:15









          Craig

          112




          112
























              up vote
              0
              down vote













              In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



              One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



              To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



              Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






              share|improve this answer

























                up vote
                0
                down vote













                In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



                One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



                To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



                Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



                  One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



                  To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



                  Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






                  share|improve this answer












                  In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



                  One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



                  To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



                  Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 22 at 9:28









                  deceze

                  390k61529685




                  390k61529685






















                      up vote
                      0
                      down vote













                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






                      share|improve this answer























                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
                        – Tom Blodget
                        Nov 22 at 15:30















                      up vote
                      0
                      down vote













                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






                      share|improve this answer























                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
                        – Tom Blodget
                        Nov 22 at 15:30













                      up vote
                      0
                      down vote










                      up vote
                      0
                      down vote









                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






                      share|improve this answer














                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Nov 22 at 9:39

























                      answered Nov 22 at 9:31









                      Henry

                      33.1k54259




                      33.1k54259












                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
                        – Tom Blodget
                        Nov 22 at 15:30


















                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
                        – Tom Blodget
                        Nov 22 at 15:30
















                      In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
                      – Tom Blodget
                      Nov 22 at 15:30




                      In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)
                      – Tom Blodget
                      Nov 22 at 15:30


















                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53427032%2funderstanding-encoding-schemes%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Berounka

                      Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

                      Sphinx de Gizeh