gitlab-org--gitlab-foss/lib/gitlab/encoding_helper.rb

module Gitlab
  module EncodingHelper
    extend self

    # This threshold is carefully tweaked to prevent usage of encodings detected
    # by CharlockHolmes with low confidence. If CharlockHolmes confidence is low,
    # we're better off sticking with utf8 encoding.
    # Reason: git diff can return strings with invalid utf8 byte sequences if it
    # truncates a diff in the middle of a multibyte character. In this case
    # CharlockHolmes will try to guess the encoding and will likely suggest an
    # obscure encoding with low confidence.
    # There is a lot more info with this merge request:
    # https://gitlab.com/gitlab-org/gitlab_git/merge_requests/77#note_4754193
    ENCODING_CONFIDENCE_THRESHOLD = 50

    def encode!(message)
      message = force_encode_utf8(message)
      return message if message.valid_encoding?

      # return message if message type is binary
      detect = CharlockHolmes::EncodingDetector.detect(message)
      return message.force_encoding("BINARY") if detect_binary?(message, detect)

      if detect && detect[:encoding] && detect[:confidence] > ENCODING_CONFIDENCE_THRESHOLD
        # force detected encoding if we have sufficient confidence.
        message.force_encoding(detect[:encoding])
      end

      # encode and clean the bad chars
      message.replace clean(message)
    rescue ArgumentError => e
      return unless e.message.include?('unknown encoding name')

      encoding = detect ? detect[:encoding] : "unknown"
      "--broken encoding: #{encoding}"
    end

    def detect_binary?(data, detect = nil)
      detect ||= CharlockHolmes::EncodingDetector.detect(data)
      detect && detect[:type] == :binary && detect[:confidence] == 100
    end

    def detect_libgit2_binary?(data)
      # EncodingDetector checks the first 1024 * 1024 bytes for NUL byte, libgit2 checks
      # only the first 8000 (https://github.com/libgit2/libgit2/blob/2ed855a9e8f9af211e7274021c2264e600c0f86b/src/filter.h#L15),
      # which is what we use below to keep a consistent behavior.
      detect = CharlockHolmes::EncodingDetector.new(8000).detect(data)
      detect && detect[:type] == :binary
    end

    def encode_utf8(message)
      message = force_encode_utf8(message)
      return message if message.valid_encoding?

      detect = CharlockHolmes::EncodingDetector.detect(message)
      if detect && detect[:encoding]
        begin
          CharlockHolmes::Converter.convert(message, detect[:encoding], 'UTF-8')
        rescue ArgumentError => e
          Rails.logger.warn("Ignoring error converting #{detect[:encoding]} into UTF8: #{e.message}")

          ''
        end
      else
        clean(message)
      end
    rescue ArgumentError
      nil
    end

    def encode_binary(str)
      return "" if str.nil?

      str.dup.force_encoding(Encoding::ASCII_8BIT)
    end

    def binary_stringio(str)
      StringIO.new(str.freeze || '').tap { |io| io.set_encoding(Encoding::ASCII_8BIT) }
    end

    private

    def force_encode_utf8(message)
      raise ArgumentError unless message.respond_to?(:force_encoding)
      return message if message.encoding == Encoding::UTF_8 && message.valid_encoding?

      message = message.dup if message.respond_to?(:frozen?) && message.frozen?

      message.force_encoding("UTF-8")
    end

    def clean(message)
      message.encode("UTF-16BE", undef: :replace, invalid: :replace, replace: "".encode("UTF-16BE"))
        .encode("UTF-8")
        .gsub("\0".encode("UTF-8"), "")
    end
  end
end
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`module Gitlab`
			`module EncodingHelper`
			`extend self`

			`# This threshold is carefully tweaked to prevent usage of encodings detected`
			`# by CharlockHolmes with low confidence. If CharlockHolmes confidence is low,`
			`# we're better off sticking with utf8 encoding.`
			`# Reason: git diff can return strings with invalid utf8 byte sequences if it`
			`# truncates a diff in the middle of a multibyte character. In this case`
			`# CharlockHolmes will try to guess the encoding and will likely suggest an`
			`# obscure encoding with low confidence.`
			`# There is a lot more info with this merge request:`
			`# https://gitlab.com/gitlab-org/gitlab_git/merge_requests/77#note_4754193`
Raise encoding confidence threshold to 50 It is recommended that we set this to 50: https://gitlab.com/gitlab-org/gitlab-ce/issues/35098#note_35036746 In this particular issue, the confidence was 42 for Shift JIS, but in fact that's encoded in UTF-8 just with a single bad character. In this case, we shouldn't try to treat it as Shift JIS, but just treat it as UTF-8 and remove invalid bytes. Treating it like Shift JIS would corrupt the whole data. Unfortunately, the diff which would cause this could not be disclosed therefore we can't use it as a test example. 2017-07-20 13:02:07 +00:00			`ENCODING_CONFIDENCE_THRESHOLD = 50`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00
			`def encode!(message)`
Fix a bug where charlock_holmes was used needlessly to encode strings 2018-01-04 22:27:37 +00:00			`message = force_encode_utf8(message)`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`return message if message.valid_encoding?`

			`# return message if message type is binary`
			`detect = CharlockHolmes::EncodingDetector.detect(message)`
renames ambiguous methods and add spec 2017-09-05 17:16:08 +00:00			`return message.force_encoding("BINARY") if detect_binary?(message, detect)`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00
			`if detect && detect[:encoding] && detect[:confidence] > ENCODING_CONFIDENCE_THRESHOLD`
wip: fake its a binary diff 2017-09-03 11:45:44 +00:00			`# force detected encoding if we have sufficient confidence.`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`message.force_encoding(detect[:encoding])`
			`end`

			`# encode and clean the bad chars`
			`message.replace clean(message)`
Return a warning string if we try to encode to unsupported encoding Fixes gitlab-development-kit#321 2018-02-09 17:58:29 +00:00			`rescue ArgumentError => e`
			`return unless e.message.include?('unknown encoding name')`

Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`encoding = detect ? detect[:encoding] : "unknown"`
			`"--broken encoding: #{encoding}"`
			`end`

renames ambiguous methods and add spec 2017-09-05 17:16:08 +00:00			`def detect_binary?(data, detect = nil)`
revert to using a simple representation 2017-09-04 17:34:15 +00:00			`detect \|\|= CharlockHolmes::EncodingDetector.detect(data)`
renames ambiguous methods and add spec 2017-09-05 17:16:08 +00:00			`detect && detect[:type] == :binary && detect[:confidence] == 100`
revert to using a simple representation 2017-09-04 17:34:15 +00:00			`end`

renames ambiguous methods and add spec 2017-09-05 17:16:08 +00:00			`def detect_libgit2_binary?(data)`
revert to using a simple representation 2017-09-04 17:34:15 +00:00			`# EncodingDetector checks the first 1024 * 1024 bytes for NUL byte, libgit2 checks`
			`# only the first 8000 (https://github.com/libgit2/libgit2/blob/2ed855a9e8f9af211e7274021c2264e600c0f86b/src/filter.h#L15),`
			`# which is what we use below to keep a consistent behavior.`
			`detect = CharlockHolmes::EncodingDetector.new(8000).detect(data)`
renames ambiguous methods and add spec 2017-09-05 17:16:08 +00:00			`detect && detect[:type] == :binary`
wip: fake its a binary diff 2017-09-03 11:45:44 +00:00			`end`

Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`def encode_utf8(message)`
Fix a bug where charlock_holmes was used needlessly to encode strings 2018-01-04 22:27:37 +00:00			`message = force_encode_utf8(message)`
			`return message if message.valid_encoding?`
Avoind unnecesary `force_encoding` operations They're costly. This will also avoid some edge cases where charlock_holmes assigns a weird encoding to a perfectly valid UTF-8 string. 2017-06-14 16:11:03 +00:00
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`detect = CharlockHolmes::EncodingDetector.detect(message)`
Fix binary encoding error on MR diffs 2017-06-06 16:40:07 +00:00			`if detect && detect[:encoding]`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`begin`
			`CharlockHolmes::Converter.convert(message, detect[:encoding], 'UTF-8')`
			`rescue ArgumentError => e`
			`Rails.logger.warn("Ignoring error converting #{detect[:encoding]} into UTF8: #{e.message}")`

			`''`
			`end`
			`else`
			`clean(message)`
			`end`
Fix a bug where charlock_holmes was used needlessly to encode strings 2018-01-04 22:27:37 +00:00			`rescue ArgumentError`
Updates from `rubocop -a` 2018-07-02 10:43:06 +00:00			`nil`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`end`
fix refactoring error with Blob.binary? remove some lint 2017-09-04 19:32:57 +00:00
Resolve Naming/UncommunicativeMethod 2018-07-04 14:02:01 +00:00			`def encode_binary(str)`
			`return "" if str.nil?`
Move encoding methods to the more general EncodingHelper 2017-12-26 18:53:31 +00:00
Resolve Naming/UncommunicativeMethod 2018-07-04 14:02:01 +00:00			`str.dup.force_encoding(Encoding::ASCII_8BIT)`
Move encoding methods to the more general EncodingHelper 2017-12-26 18:53:31 +00:00			`end`

Resolve Naming/UncommunicativeMethod 2018-07-04 14:02:01 +00:00			`def binary_stringio(str)`
Fix Error 500s due to encoding issues when Wiki hooks fire Saved Wiki content goes through the GitalyClient::WikiService, which calls StringIO#set_encoding on the input stream. The problem is that this call mutates the encoding of the given string object to ASCII-88BIT, which causes problems for models expecting the data to still be in UTF-8. Freezing the input disables this behavior: https://github.com/ruby/ruby/blob/v2_4_4/ext/stringio/stringio.c#L1583 Closes #50590 2018-08-29 16:38:17 +00:00			`StringIO.new(str.freeze \|\| '').tap { \|io\| io.set_encoding(Encoding::ASCII_8BIT) }`
Move encoding methods to the more general EncodingHelper 2017-12-26 18:53:31 +00:00			`end`

Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`private`

Fix a bug where charlock_holmes was used needlessly to encode strings 2018-01-04 22:27:37 +00:00			`def force_encode_utf8(message)`
			`raise ArgumentError unless message.respond_to?(:force_encoding)`
			`return message if message.encoding == Encoding::UTF_8 && message.valid_encoding?`

			`message = message.dup if message.respond_to?(:frozen?) && message.frozen?`

			`message.force_encoding("UTF-8")`
			`end`

Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`def clean(message)`
Fix EncodingHelper#clean blowing up on UTF-16BE strings Closes gitaly#1101 2018-03-22 20:22:20 +00:00			`message.encode("UTF-16BE", undef: :replace, invalid: :replace, replace: "".encode("UTF-16BE"))`
Rename `Gitlab::Git::EncodingHelper` to `Gitlab::EncodingHelper` 2017-06-01 21:21:14 +00:00			`.encode("UTF-8")`
			`.gsub("\0".encode("UTF-8"), "")`
			`end`
			`end`
			`end`