MCP Catalogs
Home

MCP-PDF-Extractor-server

by RayenMalouche·0·Score 33

Java-based MCP server using Apache Tika to extract content and metadata from PDFs, DOCX and other documents.

file-systemdeveloper-toolsai-llm
0
Forks
0
Open issues
9 mo ago
Last commit
2d ago
Indexed

Overview

The Tika MCP Extractor Server is a comprehensive Java implementation that provides MCP-compliant tools for document extraction. It supports multiple formats including PDF, DOCX, TXT, HTML and images, converting content to HTML with embedded CSS or plain text. The server exposes four main tools: extract-to-html, extract-text, list-available-files, and get-file-metadata, all while maintaining robust error handling and comprehensive logging. Built with Spring Boot and Jetty, it offers both MCP protocol compliance and REST endpoints for testing and integration.

Try asking AI

After installing, here are 6 things you can ask your AI assistant:

you:Processing and extracting content from local documents in secure environments without internet access
you:Integrating document extraction capabilities into MCP-enabled AI assistants like Claude Desktop
you:Providing a REST API for web applications to serve styled HTML content from document files
you:What file formats are supported?
you:Can I use this server without internet access?
you:How do I add custom Tika configurations?

When to choose this

Choose this server for local document processing workflows where you need to extract content and metadata without exposing documents to external services.

When NOT to choose this

Avoid if you need cloud-based processing or have already established infrastructure in other languages like Python.

Tools this server exposes

4 tools extracted from the README
  • extract-to-html

    Converts file content to HTML with embedded CSS styling

  • extract-text

    Extracts plain text content from files

  • list-available-files

    Lists files in the extraction directory with details

  • get-file-metadata

    Retrieves detailed metadata from files like title, author, creation date

Comparable tools

file-mcpdocument-extractor-servermcp-server-tika

Installation

Installation

  1. **Prerequisites**:

- Java 23+ - Maven 3.6+

  1. **Clone and Setup**:

``bash git clone https://github.com/RayenMalouche/MCP-PDF-Extractor-server.git cd MCP-PDF-Extractor-server mkdir files-to-extract mvn clean install ``

  1. **Configure**:

Edit src/main/resources/application.properties if needed

  1. **Run**:

```bash # HTTP/SSE mode mvn spring-boot:run

# STDIO mode mvn spring-boot:run -- --stdio ```

  1. **Configure Claude Desktop** (for MCP usage):

Add to your claude_desktop_config.json: ``json { "mcpServers": { "tika-extractor": { "command": "java", "args": ["-jar", "path/to/your/target/TikaExtractorMCPServer-1.0.0.jar", "--stdio"] } } } ``

FAQ

What file formats are supported?
The server supports PDF, DOCX, TXT, HTML, images and many other formats through Apache Tika's comprehensive type detection system.
Can I use this server without internet access?
Yes, all operations are local and don't require internet access, making it suitable for secure document processing workflows.
How do I add custom Tika configurations?
You can modify Tika settings in the `application.properties` file or extend the `ConfigLoader` class for more complex customizations.

Compare MCP-PDF-Extractor-server with

GitHub →

Last updated · Auto-generated from public README + GitHub signals.