Configuration
By default, only extraction from HTML documents is supported.
No configuration is required for that, unless you want to make
the content available through your DataObject subclass.
In this case, add the following to app/_config/config.yml:
SilverStripe\Assets\File:
extensions:
- SilverStripe\TextExtraction\Extension\FileTextExtractableBy default any extracted content will be cached against the database row. In order to stay within common size
constraints for SQL queries required in this operation, the cache sets a maximum character length after which
content gets truncated (default: 500000). You can configure this value through
Database.max_content_length in your YAML configuration.
Alternatively, extracted content can be cached using Cache to prevent excessive database growth.
In order to swap out the cache backend you can use the following YAML configuration.
---
Name: mytextextraction
After: '#textextraction'
---
SilverStripe\Core\Injector\Injector:
SilverStripe\TextExtraction\Cache\FileTextCache:
class: SilverStripe\TextExtraction\Cache\FileTextCache\Cache
SilverStripe\TextExtraction\Cache\FileTextCache\Cache:
lifetime: 3600 # Number of seconds to cache content forXPDF
PDFs require special handling, for example through the XPDF
command-line utility. Follow their installation instructions, its presence will be automatically
detected for *nix operating systems. You can optionally set the binary path (required for Windows) in app/_config/config.yml:
SilverStripe\TextExtraction\Extractor\PDFTextExtractor:
binary_location: /my/path/pdftotext