The SanitizationFilter was running before the WikiFilter. Since
WikiFilter can modify links, we could see links that _should_ be stopped
by SanatizationFilter being rendered on the page. I (kerrizor) had
previously addressed the bug in: 7bc971915b
However, an additional exploit was discovered after that was merged.
Working through the issue, we couldn't simply shuffle the order of
filters, due to some implicit assumptions about the order of filters, so
instead we've extracted the logic that sanitizes a Nokogiri-generated
Node object, and applied it to the WikiLinkFilter as well.
On moving filters around:
Once we start moving around filters, we get cascading failures; fix one,
another one crops up. Many of the existing filters in the WikiPipeline
chain seem to assume that other filters have already done their work,
and thus operate on a "transform anything that's left" basis;
WikiFilter, for instance, assumes any link it finds in the markdown
should be prepended with the wiki_base_path.. but if it does that, it
also turns `href="@user"` into `href="/path/to/wiki/@user"`, which the
UserReferenceFilter doesn't see as a user reference it needs to
transform into a user profile link. This is true for all the reference
filters in the WikiPipeline.
Excludes a few filters that require more work:
* lib/banzai/filter/image_lazy_load_filter_spec.rb
* lib/banzai/filter/syntax_highlight_filter_spec.rb
* lib/banzai/filter/table_of_contents_filter_spec.rb
Part of #47424
We displayed the correct text as the link text (without double-encoding), but
didn't do the same for the actual link target, so any link containing an
ampersand would break when auto-linked.
Rinku 2.0.0 (the version we use) will remove the last character of a link if
it's a closing part of a punctuation pair (different types of parentheses and
quotes), unless both of the below are true:
1. The matching pair has different start and end characters.
2. There are equal numbers of both in the matched string (they don't have to be
balanced).
By using clever XPath queries we can quite significantly improve the
performance of this method. The actual improvement depends a bit on the
amount of links used but in my tests the new implementation is usually
around 8 times faster than the old one. This was measured using the
following benchmark:
require 'benchmark/ips'
text = '<p>' + Note.select("string_agg(note, '') AS note").limit(50).take[:note] + '</p>'
document = Nokogiri::HTML.fragment(text)
filter = Banzai::Filter::AutolinkFilter.new(document, autolink: true)
puts "Input size: #{(text.bytesize.to_f / 1024 / 1024).round(2)} MB"
filter.rinku_parse
Benchmark.ips(time: 15) do |bench|
bench.report 'text_parse' do
filter.text_parse
end
bench.report 'text_parse_fast' do
filter.text_parse_fast
end
bench.compare!
end
Here the "text_parse_fast" method is the new implementation and
"text_parse" the old one. The input size was around 180 MB. Running this
benchmark outputs the following:
Input size: 181.16 MB
Calculating -------------------------------------
text_parse 1.000 i/100ms
text_parse_fast 9.000 i/100ms
-------------------------------------------------
text_parse 13.021 (±15.4%) i/s - 188.000
text_parse_fast 112.741 (± 3.5%) i/s - 1.692k
Comparison:
text_parse_fast: 112.7 i/s
text_parse: 13.0 i/s - 8.66x slower
Again the production timings may (and most likely will) vary depending
on the input being processed.