1
0
Fork 0
mirror of https://github.com/ruby/ruby.git synced 2022-11-09 12:17:21 -05:00

[DOC] Enhancements for encoding.rdoc (#5578)

Adds sections:

    String Encoding
    Symbol and Regexp Encodings
    Filesystem Encoding
    Locale Encoding
    IO Encodings
        External Encoding
        Internal Encoding
    Script Encoding
    Transcoding
        Transcoding a String
This commit is contained in:
Burdette Lamar 2022-02-24 14:10:49 -06:00 committed by GitHub
parent fc7e42a473
commit c19a631c99
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
Notes: git 2022-02-25 05:11:10 +09:00
Merged-By: BurdetteLamar <BurdetteLamar@Yahoo.com>

View file

@ -132,7 +132,175 @@ returns the \Encoding of the concatenated string, or +nil+ if incompatible:
s1 = "\xa1\xa1".force_encoding('euc-jp') # => "\x{A1A1}"
Encoding.compatible?(s0, s1) # => nil
==== \Encoding Options
=== \String \Encoding
A Ruby String object has an encoding that is an instance of class \Encoding.
The encoding may be retrieved by method String#encoding.
The default encoding for a string literal is the script encoding
(see Encoding@Script+encoding):
's'.encoding # => #<Encoding:UTF-8>
The default encoding for a string created with method String.new is:
- For a \String object argument, the encoding of that string.
- For a string literal, the script encoding (see Encoding@Script+encoding).
In either case, any encoding may be specified:
s = String.new(encoding: 'UTF-8') # => ""
s.encoding # => #<Encoding:UTF-8>
s = String.new('foo', encoding: 'ASCII-8BIT') # => "foo"
s.encoding # => #<Encoding:ASCII-8BIT>
The encoding for a string may be changed:
s = "R\xC3\xA9sum\xC3\xA9" # => "Résumé"
s.encoding # => #<Encoding:UTF-8>
s.force_encoding('ISO-8859-1') # => "R\xC3\xA9sum\xC3\xA9"
s.encoding # => #<Encoding:ISO-8859-1>
Changing the assigned encoding does not alter the content of the string;
it changes only the way the content is to be interpreted:
s # => "R\xC3\xA9sum\xC3\xA9"
s.force_encoding('UTF-8') # => "Résumé"
The actual content of a string may also be altered;
see {Transcoding a String}[#label-Transcoding+a+String].
Here are a couple of useful query methods:
s = "abc".force_encoding("UTF-8") # => "abc"
s.ascii_only? # => true
s = "abc\u{6666}".force_encoding("UTF-8") # => "abc晦"
s.ascii_only? # => false
s = "\xc2\xa1".force_encoding("UTF-8") # => "¡"
s.valid_encoding? # => true
s = "\xc2".force_encoding("UTF-8") # => "\xC2"
s.valid_encoding? # => false
=== \Symbol and \Regexp Encodings
The string stored in a Symbol or Regexp object also has an encoding;
the encoding may be retrieved by method Symbol#encoding or Regexp#encoding.
The default encoding for these, however, is:
- US-ASCII, if all characters are US-ASCII.
- The script encoding, otherwise (see Encoding@Script+encoding).
=== Filesystem \Encoding
The filesystem encoding is the default \Encoding for a string from the filesystem:
Encoding.find("filesystem") # => #<Encoding:UTF-8>
=== Locale \Encoding
The locale encoding is the default encoding for a string from the environment,
other than from the filesystem:
Encoding.find('locale') # => #<Encoding:IBM437>
=== \IO Encodings
An IO object (an input/output stream), and by inheritance a File object,
has at least one, and sometimes two, encodings:
- Its _external_ _encoding_ identifies the encoding of the stream.
- Its _internal_ _encoding_, if not +nil+, specifies the encoding
to be used for the string constructed from the stream.
==== External \Encoding
Bytes read from the stream are decoded into characters via the external encoding;
by default (that is, if the internal encoding is +nil),
those characters become a string whose encoding is set to the external encoding.
The default external encoding is:
- UTF-8 for a text stream.
- ASCII-8BIT for a binary stream.
f = File.open('t.rus', 'rb')
f.external_encoding # => #<Encoding:ASCII-8BIT>
The external encoding may be set by the open option +external_encoding+:
f = File.open('t.txt', external_encoding: 'ASCII-8BIT')
f.external_encoding # => #<Encoding:ASCII-8BIT>
The external encoding may also set by method #set_encoding:
f = File.open('t.txt')
f.set_encoding('ASCII-8BIT')
f.external_encoding # => #<Encoding:ASCII-8BIT>
==== Internal \Encoding
If not +nil+, the internal encoding specifies that the characters read
from the stream are to be converted to characters in the internal encoding;
those characters become a string whose encoding is set to the internal encoding.
The default internal encoding is +nil+ (no conversion).
The internal encoding may set by the open option +internal_encoding+:
f = File.open('t.txt', internal_encoding: 'ASCII-8BIT')
f.internal_encoding # => #<Encoding:ASCII-8BIT>
The internal encoding may also set by method #set_encoding:
f = File.open('t.txt')
f.set_encoding('UTF-8', 'ASCII-8BIT')
f.internal_encoding # => #<Encoding:ASCII-8BIT>
=== Script \Encoding
A Ruby script has a script encoding, which may be retrieved by:
__ENCODING__ # => #<Encoding:UTF-8>
The default script encoding is UTF-8;
a Ruby source file may set its script encoding with a magic comment
on the first line of the file (or second line, if there is a shebang on the first).
The comment must contain the word +coding+ or +encoding+,
followed by a colon, space and the Encoding name or alias:
# encoding: ISO-8859-1
__ENCODING__ #=> #<Encoding:ISO-8859-1>
=== Transcoding
_Transcoding_ is the process of revising the content of a string or stream
by changing its encoding.
==== Transcoding a \String
Each of these methods transcodes a string:
String#encode :: Transcodes a string into a new string
according to a given destination encoding,
a given or default source encoding, and encoding options.
String#encode! :: Like String#encode,
but transcodes the string in place.
String#scrub :: Transcodes a string into a new string
by replacing invalid byte sequences
with a given or default replacement string.
String#scrub! :: Like String#scrub, but transcodes the string in place.
String#unicode_normalize :: Transcodes a string into a new string
according to Unicode normalization:
String#unicode_normalize! :: Like String#unicode_normalize,
but transcodes the string in place.
=== \Encoding Options
A number of methods in the Ruby core accept keyword arguments as encoding options.