Excludes a few filters that require more work:
* lib/banzai/filter/image_lazy_load_filter_spec.rb
* lib/banzai/filter/syntax_highlight_filter_spec.rb
* lib/banzai/filter/table_of_contents_filter_spec.rb
Part of #47424
We displayed the correct text as the link text (without double-encoding), but
didn't do the same for the actual link target, so any link containing an
ampersand would break when auto-linked.
Rinku 2.0.0 (the version we use) will remove the last character of a link if
it's a closing part of a punctuation pair (different types of parentheses and
quotes), unless both of the below are true:
1. The matching pair has different start and end characters.
2. There are equal numbers of both in the matched string (they don't have to be
balanced).
By using clever XPath queries we can quite significantly improve the
performance of this method. The actual improvement depends a bit on the
amount of links used but in my tests the new implementation is usually
around 8 times faster than the old one. This was measured using the
following benchmark:
require 'benchmark/ips'
text = '<p>' + Note.select("string_agg(note, '') AS note").limit(50).take[:note] + '</p>'
document = Nokogiri::HTML.fragment(text)
filter = Banzai::Filter::AutolinkFilter.new(document, autolink: true)
puts "Input size: #{(text.bytesize.to_f / 1024 / 1024).round(2)} MB"
filter.rinku_parse
Benchmark.ips(time: 15) do |bench|
bench.report 'text_parse' do
filter.text_parse
end
bench.report 'text_parse_fast' do
filter.text_parse_fast
end
bench.compare!
end
Here the "text_parse_fast" method is the new implementation and
"text_parse" the old one. The input size was around 180 MB. Running this
benchmark outputs the following:
Input size: 181.16 MB
Calculating -------------------------------------
text_parse 1.000 i/100ms
text_parse_fast 9.000 i/100ms
-------------------------------------------------
text_parse 13.021 (±15.4%) i/s - 188.000
text_parse_fast 112.741 (± 3.5%) i/s - 1.692k
Comparison:
text_parse_fast: 112.7 i/s
text_parse: 13.0 i/s - 8.66x slower
Again the production timings may (and most likely will) vary depending
on the input being processed.