From c19a631c994e3745e821a87cc7eca3f02c33bda7 Mon Sep 17 00:00:00 2001 From: Burdette Lamar Date: Thu, 24 Feb 2022 14:10:49 -0600 Subject: [PATCH] [DOC] Enhancements for encoding.rdoc (#5578) Adds sections: String Encoding Symbol and Regexp Encodings Filesystem Encoding Locale Encoding IO Encodings External Encoding Internal Encoding Script Encoding Transcoding Transcoding a String --- doc/encoding.rdoc | 170 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 169 insertions(+), 1 deletion(-) diff --git a/doc/encoding.rdoc b/doc/encoding.rdoc index 6f663b14cd..490066b5df 100644 --- a/doc/encoding.rdoc +++ b/doc/encoding.rdoc @@ -132,7 +132,175 @@ returns the \Encoding of the concatenated string, or +nil+ if incompatible: s1 = "\xa1\xa1".force_encoding('euc-jp') # => "\x{A1A1}" Encoding.compatible?(s0, s1) # => nil -==== \Encoding Options +=== \String \Encoding + +A Ruby String object has an encoding that is an instance of class \Encoding. +The encoding may be retrieved by method String#encoding. + +The default encoding for a string literal is the script encoding +(see Encoding@Script+encoding): + + 's'.encoding # => # + +The default encoding for a string created with method String.new is: + +- For a \String object argument, the encoding of that string. +- For a string literal, the script encoding (see Encoding@Script+encoding). + +In either case, any encoding may be specified: + + s = String.new(encoding: 'UTF-8') # => "" + s.encoding # => # + s = String.new('foo', encoding: 'ASCII-8BIT') # => "foo" + s.encoding # => # + +The encoding for a string may be changed: + + s = "R\xC3\xA9sum\xC3\xA9" # => "Résumé" + s.encoding # => # + s.force_encoding('ISO-8859-1') # => "R\xC3\xA9sum\xC3\xA9" + s.encoding # => # + +Changing the assigned encoding does not alter the content of the string; +it changes only the way the content is to be interpreted: + + s # => "R\xC3\xA9sum\xC3\xA9" + s.force_encoding('UTF-8') # => "Résumé" + +The actual content of a string may also be altered; +see {Transcoding a String}[#label-Transcoding+a+String]. + +Here are a couple of useful query methods: + + s = "abc".force_encoding("UTF-8") # => "abc" + s.ascii_only? # => true + s = "abc\u{6666}".force_encoding("UTF-8") # => "abc晦" + s.ascii_only? # => false + + s = "\xc2\xa1".force_encoding("UTF-8") # => "¡" + s.valid_encoding? # => true + s = "\xc2".force_encoding("UTF-8") # => "\xC2" + s.valid_encoding? # => false + +=== \Symbol and \Regexp Encodings + +The string stored in a Symbol or Regexp object also has an encoding; +the encoding may be retrieved by method Symbol#encoding or Regexp#encoding. + +The default encoding for these, however, is: + +- US-ASCII, if all characters are US-ASCII. +- The script encoding, otherwise (see Encoding@Script+encoding). + +=== Filesystem \Encoding + +The filesystem encoding is the default \Encoding for a string from the filesystem: + + Encoding.find("filesystem") # => # + +=== Locale \Encoding + +The locale encoding is the default encoding for a string from the environment, +other than from the filesystem: + + Encoding.find('locale') # => # + +=== \IO Encodings + +An IO object (an input/output stream), and by inheritance a File object, +has at least one, and sometimes two, encodings: + +- Its _external_ _encoding_ identifies the encoding of the stream. +- Its _internal_ _encoding_, if not +nil+, specifies the encoding + to be used for the string constructed from the stream. + +==== External \Encoding + +Bytes read from the stream are decoded into characters via the external encoding; +by default (that is, if the internal encoding is +nil), +those characters become a string whose encoding is set to the external encoding. + +The default external encoding is: + +- UTF-8 for a text stream. +- ASCII-8BIT for a binary stream. + + f = File.open('t.rus', 'rb') + f.external_encoding # => # + +The external encoding may be set by the open option +external_encoding+: + + f = File.open('t.txt', external_encoding: 'ASCII-8BIT') + f.external_encoding # => # + +The external encoding may also set by method #set_encoding: + + f = File.open('t.txt') + f.set_encoding('ASCII-8BIT') + f.external_encoding # => # + +==== Internal \Encoding + +If not +nil+, the internal encoding specifies that the characters read +from the stream are to be converted to characters in the internal encoding; +those characters become a string whose encoding is set to the internal encoding. + +The default internal encoding is +nil+ (no conversion). +The internal encoding may set by the open option +internal_encoding+: + + f = File.open('t.txt', internal_encoding: 'ASCII-8BIT') + f.internal_encoding # => # + +The internal encoding may also set by method #set_encoding: + + f = File.open('t.txt') + f.set_encoding('UTF-8', 'ASCII-8BIT') + f.internal_encoding # => # + +=== Script \Encoding + +A Ruby script has a script encoding, which may be retrieved by: + + __ENCODING__ # => # + +The default script encoding is UTF-8; +a Ruby source file may set its script encoding with a magic comment +on the first line of the file (or second line, if there is a shebang on the first). +The comment must contain the word +coding+ or +encoding+, +followed by a colon, space and the Encoding name or alias: + + # encoding: ISO-8859-1 + __ENCODING__ #=> # + +=== Transcoding + +_Transcoding_ is the process of revising the content of a string or stream +by changing its encoding. + +==== Transcoding a \String + +Each of these methods transcodes a string: + +String#encode :: Transcodes a string into a new string + according to a given destination encoding, + a given or default source encoding, and encoding options. + +String#encode! :: Like String#encode, + but transcodes the string in place. + +String#scrub :: Transcodes a string into a new string + by replacing invalid byte sequences + with a given or default replacement string. + +String#scrub! :: Like String#scrub, but transcodes the string in place. + +String#unicode_normalize :: Transcodes a string into a new string + according to Unicode normalization: + +String#unicode_normalize! :: Like String#unicode_normalize, + but transcodes the string in place. + +=== \Encoding Options A number of methods in the Ruby core accept keyword arguments as encoding options.