Version 5 supported
This version of Silverstripe CMS is still supported though will not receive any additional features. Go to documentation for the most recent stable version.

Configuration

By default, only extraction from HTML documents is supported. No configuration is required for that, unless you want to make the content available through your DataObject subclass. In this case, add the following to app/_config/config.yml:

SilverStripe\Assets\File:
  extensions:
    - SilverStripe\TextExtraction\Extension\FileTextExtractable

By default any extracted content will be cached against the database row. In order to stay within common size constraints for SQL queries required in this operation, the cache sets a maximum character length after which content gets truncated (default: 500000). You can configure this value through Database.max_content_length in your YAML configuration.

Alternatively, extracted content can be cached using Cache to prevent excessive database growth. In order to swap out the cache backend you can use the following YAML configuration.

---
Name: mytextextraction
After: '#textextraction'
---
SilverStripe\Core\Injector\Injector:
  SilverStripe\TextExtraction\Cache\FileTextCache:
    class: SilverStripe\TextExtraction\Cache\FileTextCache\Cache

SilverStripe\TextExtraction\Cache\FileTextCache\Cache:
  lifetime: 3600 # Number of seconds to cache content for

XPDF

PDFs require special handling, for example through the XPDF command-line utility. Follow their installation instructions, its presence will be automatically detected for *nix operating systems. You can optionally set the binary path (required for Windows) in app/_config/config.yml:

SilverStripe\TextExtraction\Extractor\PDFTextExtractor:
  binary_location: /my/path/pdftotext