Version 6 supported

Text extraction

This module provides a framework for extracting text content from various file formats, such as PDFs and Office documents. The extracted content can be used programmatically or made available directly on your File objects.

Installation

composer require silverstripe/textextraction

GitHub repository

https://github.com/silverstripe/silverstripe-textextraction

Configuration
Configuration options, including enabling extraction for DataObjects, managing cached content length, swapping cache backends, and configuring PDF text extraction
Usage
Various methods for text extraction, including extraction via file path or File object, and using the FileTextExtractable extension
Tika
Using Apache Tika for text extraction, using either CLI or REST server configurations