Line and Column Reporting

Content

A recently reported 'bug' in HTML Tidy - 2009-06-24 - reference http://tidy.sf.net/issue/2811690 - Incorrect line/column returned for Text nodes - caused me to take a closer look at the line and column reporting, and it is more complicated than I first imagined. I tried working with an extended Test1.html below, but ran into immediate problems...

The question is what to do about 'implicit' nodes ADDED by Tidy? In the Test1.html below there are MANY added nodes... These nodes have NO real file line, column position, since they do not exist in the file!

I constructed a quick tidynodes.exe application, which only loaded the HTML file into the node tree, and then used the dumpNode, like that given in tidy.h, and added the line and column, BUT I immediately found a problem with say the added <head> and <title> IMPLICIT nodes, added by Tidy...

The output from my tidynodes.exe was as follows. The columns are -
Num, Line, Column, Node Type, [File text], T[Lexer text], if 'text' node -

Input File: [C:\Projects\tidy\tests\2811690-04.html],
Size: [799] bytes, Time: [Fri Jul 31 17:29:48 2009].
Dump from ROOT... 55 nodes... 6 implicit
 1: 1: 1: Node: DOCTYPE [<!DOCTYPE HTML PUBLIC "-//W3C//D]ok
 2: 2: 1: Node: html [<html>]ok
 3: implicit: Node: head (3,1)
 4: implicit: Node: title (41,1)
 5: 3: 1: Node: body [<body>]ok
 6: 4: 1: Node: b [<b>]ok
 7: 4: 4: Node: Text [This]T[This]ok
 8: 4: 12: Node: Text [ is&#8623;another paragraph..]T[ is↯another paragraph]ok1
 9: 6: 1: Node: div [<div style="font-size: 2em">]ok
10: implicit: Node: i (7,1)
11: 7: 1: Node: Text [This is a ]T[This is a ]ok
12: 7: 11: Node: b [<b>]ok
13: 7: 14: Node: Text [paragraph]T[paragraph]ok
14: 9: 1: Node: p [<p>]ok
15: implicit: Node: i (9,4)
16: 9: 4: Node: Text [ This is&nbsp;another paragraph.]T[This is another paragraph]Different? But
AT: 9, 5 found match to lexer text (CT=SKIP_ENTITY)
17: 10: 1: Node: table [<table>]ok
18: 11: 1: Node: caption [<caption>]ok
19: 11: 10: Node: Text [ And this is the table!..]T[And this is the table!]Different? But
AT: 11, 11 found match to lexer text
20: 12: 1: Node: tfoot [<tfoot>]ok
21: implicit: Node: tr (13,1)
22: 13: 1: Node: td [<td>]ok
23: 13: 5: Node: Text [Cell F1]T[Cell F1]ok
24: 13: 12: Node: td [<td>]ok
25: 13: 16: Node: Text [Cell F2]T[Cell F2]ok
26: 15: 1: Node: td [<td>]ok
27: 15: 5: Node: Text [Cell F3..]T[Cell F3]ok
28: 18: 1: Node: tbody [<tbody>]ok
29: 19: 1: Node: tr [<tr>]ok
30: 20: 1: Node: td [<td>]ok
31: 20: 5: Node: Text [Cell B1]T[Cell B1]ok
32: implicit: Node: tr (21,1)
33: 21: 1: Node: td [<td>]ok
34: 21: 5: Node: Text [Cell B2]T[Cell B2]ok
35: 25: 1: Node: p [<p>]ok
36: 25: 4: Node: Text [Added TAB tests]T[Added TAB tests]ok
37: 26: 1: Node: p [<p>]ok
38: 26: 4: Node: Text [1.]T[1 ]ok1
39: 26: 6: Node: b [<b>]ok
40: 26: 9: Node: Text [TAB]T[TAB]ok
41: 27: 1: Node: p [<p>]ok
42: 27: 4: Node: Text [2..]T[2 ]ok1
43: 27: 7: Node: b [<b>]ok
44: 27: 10: Node: Text [TAB]T[TAB]ok
45: 28: 1: Node: p [<p>]ok
46: 28: 4: Node: Text [At column 3, add other tests...]T[At column 3, add other tests...]ok
47: 29: 3: Node: PHP [<?php echo "this is php" ?>]ok
48: 30: 3: Node: ASP [<% asp_function() %>]ok
49: 31: 3: Node: JSTE [<# java_function() #>]ok
50: 32: 3: Node: Processing Instruction [<? proc_func() >]ok
51: 33: 26: Node: Comment [-- adj hyphen warning -->]Different? But
AT: 33, 3 found match to lexer text (CT=HYPHEN_EQUAL)
52: 34: 3: Node: XML Section [<![section]>]ok
53: 35: 3: Node: CDATA [<![CDATA[ ..some data ..]]>]Different? But diff only spacey chars!
54: 38: 41: Node: p [<p>]ok
55: 38: 44: Node: Text [eof]T[eof]ok
End dump of 49 non-implicit nodes... 46 ok, 3 found, none missed.

Note, I have marked those 'implicit' nodes, just as 'implicit', since they in fact have NO real file position... no file line or column position is applicable, except perhaps the end of the previous node... they are added by Tidy... The output in the first square brackets [] is the file text at the indicated location in the file!

And if a 'TEXT' node, then I used the tidyNodeGetValue(...) API to get the lexer text, and compare only the first character. If the same, output 'ok', which tends to indicate the line/column are correct, and 'different?' if not the SAME first character... Not a totally accurate or failsafe method, but works in the main ;=)) of course, always different if it begins with a space, since Tidy dumps that space from the lexer text...

As stated in the above 'bug':
"you'll see that for the text node "is↯another paragraph" the column value reported is 31 (where the second whitespace was suppressed) and not the expected value of 12".

And yes, my original output showed this error, before removing the SetLexerLocus() call :-

 4: 31:   Node: Text [paragraph..]T[ isâ†¯another paragraph]Different?

should be 4: 12:, and NOT 4: 31:!

And YES, it does seem the call to SetLexerLocus( doc, lexer ); when in :-

  case LEX_CONTENT:  /* element content */
    if (TY_(IsWhite)(c)) { /* was previous character white? */
      if (lexer->waswhite) {
        if (mode != Preformatted && mode != IgnoreMarkup) {
          SetLexerLocus( doc, lexer );

does this.

When you think about it logically, WHY should a 'repeated' white space character continually do SetLexerLocus()???

At the beginning of this 'get-node', the lexer locus is correctly updated to the beginning of this node, so why should a repeated space continually bump it more and more away from that beginning!

The removal of this lexer-locus call FIXES the line, column reporting from the services tidyNodeLine(node); and tidyNodeColumn(node); And my tidynodes.exe output is now correct for this test file -

 4: 12:   Node: Text [ is&#8623;anothe]T[ isâ†¯another paragraph]ok1

But this does NOT deal with the 'implicit' nodes, although these are of course not text nodes... but if we are correcting the line/column reporting of 'text' nodes, then it stands to reason that we should also deal with the line/column reporting for non-text nodes as well.

The ideas I had on this were :-

(a) New Tidy service like :-
Bool tidyNodeImplicit( TidyNode tnod ); returning 'yes' or 'no'.
This is what I have done above, to allow me to avoid outputting line/column information when the node is 'implicit'...

(b) Modify the tidyNodeLine(node); and tidyNodeColumn(node); to return ZERO (0) respectively, when the node is 'implicit'. This would allow an API user to 'know' that when 0 is returned, there is no file location for this item added by tidy...

AND, this 'fix' removing the SetLexerLocus( doc, lexer ); call, in conjunction with the previous 'fix' for TAB character expansions, also fixes the line, column even when tab characters are in the file...

top

Downloads

*TBD*

top

=== tests/2811690-04.htm (was test.html) ===

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<body>
<b>This</b> is&#8623;another  paragraph
<i>
<div style="font-size: 2em">
This is a <b>paragraph</b>
</div>
<p> This is&nbsp;another paragraph
<table>
<caption> And this is the table!
<tfoot>
<td>Cell F1<td>Cell F2</td>
 
<td>Cell F3
</tr>
</tfoot>
<tbody>
<tr>
<td>Cell B1</td></tr>
<td>Cell B2</tr>
</tbody>
</table>
</i>
<p>Added TAB tests</p>
<p>1 <b>TAB</b></p>
<p>2  <b>TAB</b></p>
<p>At column 3, add other tests...</p>
  <?php echo "this is php" ?>
  <% asp_function() %>
  <# java_function() #>
  <? proc_func() >
  <!-- THis is a comment -- adj hyphen warning -->
  <![section]>
  <![CDATA[ 
some data 
]]>
                                        <p>eof</p>
</body>
</html>

=== End tests/2811690-04.htm (was test.html) ===

top