Text extraction#
This module provides a framework for extracting text content from various file formats, such as PDFs and Office documents. The extracted content can be used programmatically or made available directly on your File objects.
Installation#
bash
composer require silverstripe/textextraction
GitHub repository#
https://github.com/silverstripe/silverstripe-textextraction
Configuration
Configuration options, including enabling extraction for DataObjects, managing cached content length, swapping cache backends, and configuring PDF text extraction
Usage
Various methods for text extraction, including extraction via file path or File object, and using the FileTextExtractable extension
Tika
Using Apache Tika for text extraction, using either CLI or REST server configurations