TIKA-4578#2462
Conversation
- Add Dockerfile with Ubuntu base, Java 17, OCR, and font support - Add docker-build.sh script to build and optionally push images - Add start-tika-grpc.sh entrypoint script - Include all tika-pipes plugins (fetchers, emitters, iterators) - Include parser packages (standard, extended, ML, scientific, sqlite3, NLP) - Add README with usage instructions and examples - Support multi-arch builds and multiple registries (Docker Hub, ECR, ACR)
- Add two exec-maven-plugin configurations matching reference pattern - First plugin: runs TikaGrpcServer (exec:java) - Second plugin: chmod and docker-build.sh executions - Add validate phase execution to chmod +x the docker-build.sh script - Add package phase execution to run prepare-docker-image - Add skip.docker.build property (default: true) to control execution - Update README with Maven integration instructions
- Pass MULTI_ARCH, AWS_REGION, AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, DOCKER_ID, PROJECT_NAME, RELEASE_IMAGE_TAG - Update README with comprehensive Maven + env var examples - Enable full control: MULTI_ARCH=false DOCKER_ID=ndipiazza PROJECT_NAME=tika-grpc RELEASE_IMAGE_TAG=4.0.0-SNAPSHOT mvn clean package -Dskip.docker.build=false - Clean up duplicate examples in README
There was a problem hiding this comment.
Pull request overview
This PR adds Docker image build capabilities to the tika-grpc module, enabling automated builds during the Maven package phase. The implementation supports pushing images to Docker Hub, AWS ECR, and Azure Container Registry, with optional multi-architecture build support.
Key changes:
- Maven integration via exec-maven-plugin to trigger Docker builds during package phase
- Shell scripts for Docker image building and container startup
- Profile-based activation using environment variables (DOCKER_ID, AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| tika-grpc/pom.xml | Adds Maven profiles and exec-maven-plugin configuration to trigger Docker builds based on environment variables |
| tika-grpc/docker-build/docker-build.sh | Shell script that assembles Docker context, handles registry authentication, and executes docker build with appropriate tags |
| tika-grpc/docker-build/start-tika-grpc.sh | Container entrypoint script that configures and starts the Tika gRPC server with environment-based settings |
| tika-grpc/docker-build/Dockerfile | Defines Ubuntu-based image with Java, Tesseract OCR, GDAL, and font support for the Tika gRPC server |
| tika-grpc/docker-build/README.md | Comprehensive documentation covering build options, environment variables, and usage examples |
| tika-grpc/README.md | Updated main README with quick start documentation for Docker builds |
Comments suppressed due to low confidence (1)
tika-grpc/docker-build/docker-build.sh:95
- The buildx builder 'tikabuilder' is created but may already exist from a previous run, which will cause the 'docker buildx create' command to fail. Add the '--use' flag or check if the builder exists first, or use 'docker buildx create --name tikabuilder --driver docker-container --bootstrap || true' to avoid errors on re-runs.
docker buildx create --name tikabuilder
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
… On Tue, Dec 16, 2025 at 1:44 PM Nicholas DiPiazza ***@***.***> wrote:
https://issues.apache.org/jira/browse/TIKA-4578
adding the ability to do stuff like this
MULTI_ARCH=false \
DOCKER_ID=ndipiazza \
PROJECT_NAME=tika-grpc \
RELEASE_IMAGE_TAG=4.0.0-SNAPSHOT \
mvn package -DskipTests -f tika-grpc
View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/s064ofjilbw5g09r82mtlb0mj
===================================================================================================
Done running docker build with tag -t ndipiazza/tika-grpc:4.0.0-SNAPSHOT
===================================================================================================
------------------------------
You can view, comment on, or merge this pull request online at:
#2462
Commit Summary
- 08df7e7
<08df7e7>
TIKA-4578: Add Docker build configuration for tika-grpc
- ffb8a42
<ffb8a42>
TIKA-4578: Integrate Docker build into Maven lifecycle
- ffbe00a
<ffbe00a>
TIKA-4578: Pass all environment variables from Maven to docker-build.sh
- daea452
<daea452>
TIKA-4578 - Add profiles to enable Docker build for AWS, Azure, and Docker
Hub
- 75d1208
<75d1208>
TIKA-4578 - Add profiles to enable Docker build for AWS, Azure, and Docker
Hub
File Changes
(6 files <https://github.com/apache/tika/pull/2462/files>)
- *M* tika-grpc/README.md
<https://github.com/apache/tika/pull/2462/files#diff-8eb1dd37bbabb938fa86e44aae1037caea81daa9f0de2214bc515871424882f6>
(56)
- *A* tika-grpc/docker-build/Dockerfile
<https://github.com/apache/tika/pull/2462/files#diff-e19d5934b994905dcd957ddaf3e1948fa19dfd05e165d2a0440ae2bf08598593>
(39)
- *A* tika-grpc/docker-build/README.md
<https://github.com/apache/tika/pull/2462/files#diff-950b4c55c8551c3951b039259629270e3a6a4f41e0c6ab7198ebaaf2b4e36c87>
(170)
- *A* tika-grpc/docker-build/docker-build.sh
<https://github.com/apache/tika/pull/2462/files#diff-3665f4ba56a8416c1010c9658624c9c95e67c97672fa70ad321829e70722e217>
(113)
- *A* tika-grpc/docker-build/start-tika-grpc.sh
<https://github.com/apache/tika/pull/2462/files#diff-de3ec4b62dfe6e52a270160654a736246fcde8e488bb301d323006881844f476>
(29)
- *M* tika-grpc/pom.xml
<https://github.com/apache/tika/pull/2462/files#diff-8dd3f2e428e7f3f7d7fe38b6a26f0f494f2269b3215619da4ad5a6c9cdebb24b>
(83)
Patch Links:
- https://github.com/apache/tika/pull/2462.patch
- https://github.com/apache/tika/pull/2462.diff
—
Reply to this email directly, view it on GitHub
<#2462>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARMSMNXDLFUFBLQVAW2CEH34CBVJVAVCNFSM6AAAAACPHSKNFKVHI2DSMVQWIX3LMV43ASLTON2WKOZTG4ZTMMRWGAZDQNI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Security and Robustness: - Pin base image to ubuntu:22.04 instead of ubuntu:latest for reproducible builds - Pin tonistiigi/binfmt to specific digest (sha256:8de6f2dec...) to prevent supply chain attacks - Add Docker installation check at script start with clear error message - Add error handling for AWS/Azure authentication failures - Add buildx builder cleanup (rm) after multi-arch build Code Quality: - Remove hardcoded VERSION default in Dockerfile ARG - Change start-tika-grpc.sh shebang from /bin/sh to /bin/bash for consistency - Merge duplicate exec-maven-plugin declarations into single plugin config - Replace chmod exec with maven-antrun-plugin for cross-platform compatibility - Add WARNING prefix to plugin skip messages for better visibility Documentation: - Clarify two build activation mechanisms (env vars vs -Dskip.docker.build) - Update all examples to show environment variable activation (no -Dskip.docker.build needed) - Add prerequisite step to build from root before Docker build - Add note about multi-arch --push behavior requiring authentication - Use -pl :tika-grpc -am in examples to build only required modules - Remove unnecessary 'clean' from example commands All 18 review comments addressed. Multi-arch builds now use pinned digest for security. Documentation clearly explains activation precedence.
|
@tballison interested what you think of this one. |
|
test job failed due to flaky 500 error from dep server |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix README example showing Maven property syntax instead of environment variable - Add 'set -e' to docker-build.sh for automatic error handling - Remove unnecessary '-r' flag from cp commands for files - Add buildx builder existence check to avoid duplicate creation errors - Add proper error handling for docker build commands with cleanup - Add date comment for binfmt digest verification - Pass VERSION build arg to docker build commands - Move COPY commands after RUN in Dockerfile for better cache efficiency - Add non-root user (tika) to run container for security - Add TIKA_VERSION validation in start-tika-grpc.sh
|
@tballison to keep things simple i will use |
|
I have had Infra create a new dockerhub repo The goal is to create a tika-grpc service from this https://github.com/apache/tika/tree/main/tika-grpc in order to stay in alignment with what tika-server does, i then created https://github.com/apache/tika-grpc-docker |
|
this PR is obsoleted by a PR coming in https://github.com/apache/tika-grpc-docker |
…Store - Created build-from-branch.sh script to build Docker images from Git branches - Added Dockerfile.ignite for building with Ignite ConfigStore support - Added sample Ignite configuration and documentation - Updated main README with build-from-branch instructions This enables testing development features (like TIKA-4583 Ignite ConfigStore) before they are officially released, without needing to modify the main Tika repository per PR #2462. Usage: ./build-from-branch.sh -b TIKA-4583-ignite-config-store -i Features: - Builds from any Git branch or tag - Optional Ignite ConfigStore plugin inclusion - Supports custom repositories (forks) - Automatic testing after build - Optional push to registry Related to: apache/tika#2462, TIKA-4583
📝 Documentation UpdateI've added documentation to clarify the relationship between this Maven-based Docker build and the tika-grpc-docker repository: What's Newtika-grpc/docker-build/README.md now includes a section explaining:
tika-grpc/README.md now includes:
Why This MattersSince tika-grpc will first be released in Tika 4.0.0, it's important to clarify that:
This aligns with Apache best practices where official release images should be built from GPG-verified artifacts, not from source code. Related PRI've also created a complementary PR in tika-grpc-docker: apache/tika-grpc-docker#1 This ensures both repositories have consistent documentation and are prepared for the Tika 4.0.0 release. |
https://issues.apache.org/jira/browse/TIKA-4578
add ability to be able to build docker image, tag it, and deploy it to repository using maven docker plugin.