• Getting Started
  • Lessons
  • Developer Guides
  • Optional features
    • Advanced Workflow
    • Elemental blocks
    • Content Localisation with Fluent
    • GridField Bulk Editing Tools
    • GridField Extensions
    • Linkfield
    • Login forms
    • Multi-factor authentication (MFA)
    • Non-blocking File-based Sessions
    • Queued Jobs
    • RealMe
    • Static Publish Queue
    • TagField
    • Taxonomies
    • Text Extraction
      • Configuration
      • Usage
      • Apache Solr
      • Tika
    • TOTP Authenticator
    • UserForms
    • WebAuthn Authenticator
  • Upgrading
  • Changelogs
  • Contributing
  • Project Governance
  1. Optional features
  2. Text Extraction
Version 5Supported

This version of Silverstripe CMS is still supported though will not receive any additional features.

Go to documentation for the most recent stable version

Text extraction#

On this page

  • Installation
  • GitHub repository

This module provides a framework for extracting text content from various file formats, such as PDFs and Office documents. The extracted content can be used programmatically or made available directly on your File objects.

Installation#

bash
composer require silverstripe/textextraction

GitHub repository#

https://github.com/silverstripe/silverstripe-textextraction

Configuration

Configuration options, including enabling extraction for DataObjects, managing cached content length, swapping cache backends, and configuring PDF text extraction

Usage

Various methods for text extraction, including extraction via file path or File object, and using the FileTextExtractable extension

Apache Solr

Apache Solr's role in text extraction using Apache Tika, its configuration, and content indexing

Tika

Using Apache Tika for text extraction, using either CLI or REST server configurations

Edit on GitHub