-
Notifications
You must be signed in to change notification settings - Fork 54
Description
The diff algorithm does not always keep HTML entity references intact, which can be problematic when later loading the resulting diff as HTML (e.g. when diffing a list entry). This is best illustrated by the following test:
<options></options>
<oldText>
<ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non justo & sapien;</li></ol>
</oldText>
<newText>
<ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non sapien et justo;</li></ol>
</newText>
<expected>
<ol class="diff-list"><li class="normal">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> & sapien</del>;</li></ol>
</expected>This test currently crashes with an error message DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 from ListDiffLines.php:409. This is because it tries to load the string
<body>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &</del>;<del class="diffdel"> sapien;</del></body>You can see that the diff algorithm broke up the & from its terminating ;, which leads to invalid HTML.
The solution might be to consider HTML entities a special case in the regex in AbstractDiff.php line 457 (e.g. adding an extra &[a-zA-Z0-9]+; case there). That fixes the given test without breaking the others, but I cannot oversee what further possible impact that may have.