Using Tidy for Indexing HTML Pages
In order to index an HTML baggage file, Reverb creates an XHTML copy of the file using Tidy (tool for cleaning up HTML files) to get a valid XML file that ePublisher can read. As useful as Tidy is, there may be times where it does not recognize a tag or generates something improperly. Tidy is configurable and can be adjusted to convert the HTML in the proper way.
When Tidy does not recognize a tag in an HTML file, an error like the following is produced:
line 33 column 3 - Error: <not_recognized_tag> is not recognized!
This error means that Tidy wasn’t able to generate an XHTML copy of the HTML file, and therefore ePublisher won’t be able to index it as a baggage file. With the right adjustments, this can be fixed.
Configuring Tidy To Recognize New Tags
1. Go to your Tidy directory under the installation directory in your local computer: ...\WebWorks\ePublisher\<VERSION>\Helpers\tidy\
2. Create a Format override of this helper. To do this: in the sub-folder of your project called: Formats, where the Format overrides live, create a new folder called Helpers and copy the entire folder called tidy (from step 1) to this new folder.
3. In the newly created tidy folder, open your config.txt file.
4. Depending on the kind of tag you want to add, you’ll have to uncomment line 8 or 10, or maybe both in the config.txt file.
5. Substitute the placeholder we put there and after the colon, with your new tag name (for example: not_recognized_tag).
6. Save and close the file.
To know more about how to customize Tidy go to https://www.w3.org/People/Raggett/tidy/.
Was this helpful?
Last modified date: 01/19/2022