Silverstripe
Silverstripe CMSDocs
  • Getting Started
  • Lessons
  • Developer Guides
  • Optional features
    • Advanced Workflow
    • Elemental blocks
    • Content Localisation with Fluent
    • GridField Bulk Editing Tools
    • GridField Extensions
    • Linkfield
    • Login forms
    • Multi-factor authentication (MFA)
    • Non-blocking File-based Sessions
    • Queued Jobs
    • RealMe
    • Static Publish Queue
    • TagField
    • Taxonomies
    • Text Extraction
      • Configuration
      • Usage
      • Apache Solr
      • Tika
    • TOTP Authenticator
    • UserForms
    • WebAuthn Authenticator
  • Upgrading
  • Changelogs
  • Contributing
  • Project Governance
  1. Optional features/
  2. Text Extraction
Version 5Supported

This version of Silverstripe CMS is still supported though will not receive any additional features.

Go to documentation for the most recent stable version →

Text extraction#

On this page

  • Installation
  • GitHub repository

This module provides a framework for extracting text content from various file formats, such as PDFs and Office documents. The extracted content can be used programmatically or made available directly on your File objects.

Installation#

bash
composer require silverstripe/textextraction

GitHub repository#

https://github.com/silverstripe/silverstripe-textextraction

Configuration
Configuration options, including enabling extraction for DataObjects, managing cached content length, swapping cache backends, and configuring PDF text extraction
Usage
Various methods for text extraction, including extraction via file path or File object, and using the FileTextExtractable extension
Apache Solr
Apache Solr's role in text extraction using Apache Tika, its configuration, and content indexing
Tika
Using Apache Tika for text extraction, using either CLI or REST server configurations

Edit on GitHub