dtSearch File Converters

dtSearch products embed dtSearch's own document filters and converters covering a wide range of popular file types. Files are converted to XML or HTML for browser-based display with highlighted hits.

The Document Filters & Converters can also be licensed for integration into other products via the dtSearch Engine File Converter API.

File Converter API

The document data to convert can consist of a binary document file, such as a Word document, and any number of field-value pairs; additional text can also be included in the converted output. The binary document file to convert can be provided either as an accessible disk file or in a memory buffer. By default, the File Converter converts the input file to HTML, other supported options are: RTF, UTF8 (Unicode text), ANSI, and XML (for XML input data only), the output can be a string object or a disk file.

Document Filters and Supported Data Types

dtSearch’s proprietary document filters support a broad spectrum of data formats. Support includes both data parsing and data conversion to HTML for browser display with highlighted hits. Supported data types

  • Web-ready static data: support includes integrated image and text support in HTML, XML/XSL and PDF.

  • MS Office documents: support covers integrated image and text support in Word (DOC/DOCX), PowerPoint (PPT/PPTX), Excel (XLS/XLSX) and Access (MDB/ACCDB).

  • Other “Office” documents and compression formats: support covers PDF including integrated image and text support, RTF, OpenOffice, ZIP, RAR,GZIP/TAR, etc.

  • Emails and email attachments: support covers MS Exchange, Outlook,
    Thunderbird, and other popular email types, with integrated image and text support in Thunderbird (MBOX/EML) and Outlook (PST/MSG) files; support also covers the full-text of attachments, including nested attachments.

  • Recursively embedded files: support covers embedded versions of the above supported document types; for example, the document filters would provide support for an image embedded in a PowerPoint document that is itself embedded in a Word document and attached as a ZIP file to an email message.

  • Web-based dynamic data: support (through the dtSearch Spider) covers integrated image and text support in PHP, ASP.NET, SharePoint, etc.

  • Other databases: SQL-type databases (through the dtSearch Engine APIs) along with the full-text of BLOB data, Access, XBASE, XML, CSV, etc.

  • Document Filter APIs. APIs (including Windows and Linux, native 64-bit and 32-bit, C++, Java and .NET through current versions) provide parsing for all of the above data types, as well as conversion to HTML for browser display with highlighted hits. An “object extraction” API lets developers navigate through the structure of each embedded object as a hierarchy, and optionally extract each object, such as an image in an MS Word file embedded in an MS Access document, compressed and attached to an email.