Skip to content

fix: FileTypeRouter no longer silently drops "+"-containing MIME types#11648

Merged
anakin87 merged 4 commits into
deepset-ai:mainfrom
Aarkin7:fix/file-type-router-literal-mime-matching
Jun 24, 2026
Merged

fix: FileTypeRouter no longer silently drops "+"-containing MIME types#11648
anakin87 merged 4 commits into
deepset-ai:mainfrom
Aarkin7:fix/file-type-router-literal-mime-matching

Conversation

@Aarkin7

@Aarkin7 Aarkin7 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

FileTypeRouter.init was compiling every entry in mime_types as a regex, so any +-containing IANA type (image/svg+xml, application/ld+json, every application/+xml and application/+json) had its + read as a regex quantifier and silently fell into the unclassified bucket, no error, no warning.

Fix: in run, check each source's MIME type against mime_types by exact equality first, then fall back to the existing regex fullmatch. Literal MIMEs now route correctly without re.escape or any init-time classification; explicit regex patterns like audio/.* keep working unchanged.

How did you test it?

Added 4 new unit tests in test/components/routers/test_file_router.py:

  • test_literal_mime_with_regex_metacharacters_matches_self (parametrized across 5 MIMEs containing + or dotted IANA segments, including OOXML)
  • test_to_dict_from_dict_preserves_literal_and_regex_mix
  • test_pipeline_output_socket_name_matches_literal_mime_with_plus (end-to-end Pipeline, guards that the output socket name preserves the user's original +-bearing string)
  • test_additional_mimetypes_with_literal_plus

Also updated the pre-existing test_invalid_regex_pattern to match the widened error message.

Wider verification:

  • test/components/routers/ — 30/30 pass
  • test/components/routers/ + test/core/pipeline/ — 491/491 pass
  • hatch run fmt-check clean
  • hatch run test:types haystack/components/routers/file_type_router.py clean
  • Manual repro of the original image/svg+xml failure now routes correctly

Notes for the reviewer

  • Bucket key stays the user's original mime_types entry, so output socket names, pipe.connect("router.image/svg+xml", ...), and to_dict / from_dict round-trips are unaffected. The pipeline-socket test guards this explicitly.
  • The error message for invalid input was widened from "Invalid regex pattern" to "Invalid MIME type or regex pattern" to reflect that the parameter formally accepts both. The existing test was updated accordingly.
  • No public API change, no serialization change.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@Aarkin7 Aarkin7 requested a review from a team as a code owner June 15, 2026 18:56
@Aarkin7 Aarkin7 requested review from anakin87 and removed request for a team June 15, 2026 18:56
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Jun 15, 2026
@anakin87

Copy link
Copy Markdown
Member

While the bug is real, I have a few comments.

The . in literals like application/pdf acted as a wildcard, so unrelated strings such as applicationXpdf matched the wrong bucket.

This example is wrong. "application/vnd.ms-excel" might be a better one.


I'd suggest implementing a simpler solution like this

  # run : exact literal match first, regex only as fallback
  matched = False
  if mime_type:
      for raw, pattern in self.mime_type_patterns:
          if mime_type == raw or pattern.fullmatch(mime_type):
              mime_types[raw].append(source)
              matched = True
              break
  if not matched:
      mime_types["unclassified"].append(source)

WDYT? This works for you?


Another general note: good to add unit tests, but let's try to keep a good coverage without adding many duplicates.

@anakin87 anakin87 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above

@Aarkin7 Aarkin7 changed the title fix: FileTypeRouter matches literal MIME types exactly so "+" and "." stop being treated as regex fix: FileTypeRouter no longer silently drops "+"-containing MIME types Jun 24, 2026
@Aarkin7 Aarkin7 requested a review from anakin87 June 24, 2026 13:55
@Aarkin7

Aarkin7 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Hi @anakin87
Pushed the simpler approach you suggested and trimmed the tests down to 4. Let me know what you think!

@github-actions

Copy link
Copy Markdown
Contributor

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/components/routers
  file_type_router.py
Project Total  

This report was generated by python-coverage-comment-action

@anakin87 anakin87 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@anakin87 anakin87 merged commit 6e1149b into deepset-ai:main Jun 24, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FileTypeRouter silently drops MIME types containing "+" (e.g. image/svg+xml) into "unclassified"

2 participants