Skip to content

fix: replace urlretrieve in getRemoteFile#444

Draft
afuetterer wants to merge 6 commits intochrismattmann:masterfrom
afuetterer:442-tests
Draft

fix: replace urlretrieve in getRemoteFile#444
afuetterer wants to merge 6 commits intochrismattmann:masterfrom
afuetterer:442-tests

Conversation

@afuetterer
Copy link
Contributor

@afuetterer afuetterer commented Feb 24, 2026

Failing unit tests

  • see test: remove failing unit test #447
  • FAILED tika/tests/test_tika.py::CreateTest::test_remote_jpg - urllib.error.HTTPError: HTTP Error 403: Forbidden
  • FAILED tika/tests/test_tika.py::CreateTest::test_remote_pdf - urllib.error.HTTPError: HTTP Error 502: Bad Gateway

Tasks

  • fix broken url to pdf file
  • decide how to deal with urlretrieve
  • rebase
  • remove unit test 1 commit
  • remove deps commit
  • squash all commits

As discussed in chrismattmann#442 this test is broken and the logic it is testing is outside of tika-python anyway.
@afuetterer

This comment was marked as outdated.

@afuetterer afuetterer mentioned this pull request Feb 24, 2026
10 tasks
@chrismattmann
Copy link
Owner

As per your comment in #442:

  1. Let's remove the test_ssl_link.py test. Not sure what it's testing either and it has become outdated regardless.
  2. Header issue. Please update.
  3. Any old PDF file should do....
  1. The header issue is not in the test, but in the library:
    urlretrieve(urlOrPath, destPath)
  2. Same here I think. The broken pdf url is just step 1, I think if there is a valid url, it will still return 403

Got it, so we have to fix urlRetrieve then? Perhaps we should add a unit test for it too.

@afuetterer
Copy link
Contributor Author

As I said above, replacing the url to a remote pdf, does help a bit.
Now we got from
FAILED tika/tests/test_tika.py::CreateTest::test_remote_pdf - urllib.error.HTTPError: HTTP Error 502: Bad Gateway
to
FAILED tika/tests/test_tika.py::CreateTest::test_remote_pdf - urllib.error.HTTPError: HTTP Error 403: Forbidden

The error is in getRemoteFile.

It is a common issue: https://www.google.com/search?q=python+urlretrieve+403+forbidden

Maybe use custom headers here or switch to using requests.

@afuetterer afuetterer marked this pull request as draft February 24, 2026 19:42
@afuetterer

This comment was marked as outdated.

@afuetterer

This comment was marked as resolved.

@afuetterer afuetterer changed the title Fix failing unit tests so CI can pass fix: replace urlretrieve in getRemoteFile Feb 26, 2026
@afuetterer afuetterer changed the title fix: replace urlretrieve in getRemoteFile fix: replace urlretrieve in getRemoteFile Feb 26, 2026
@afuetterer
Copy link
Contributor Author

I suggest splitting the issues discovered during the work on this PR into smaller problems, that can be reviewed individually and merged faster.

The "main issue" discovered in this PR is the broken getRemoteFile. The problem is in tika-python's library, not the unit tests. Agreed?

How to proceed from here?

Write and use a new helper function? Something like _urlretrieve() It should use requests internally and send a user-agent of "tika-python"? Streaming response, due to potentially large files?

What do you think?

@chrismattmann
Copy link
Owner

I suggest splitting the issues discovered during the work on this PR into smaller problems, that can be reviewed individually and merged faster.

The "main issue" discovered in this PR is the broken getRemoteFile. The problem is in tika-python's library, not the unit tests. Agreed?

How to proceed from here?

Write and use a new helper function? Something like _urlretrieve() It should use requests internally and send a user-agent of "tika-python"? Streaming response, due to potentially large files?

What do you think?

Yes, perfect let's write a new _urlRetrieve that:

  1. uses requests internally
  2. uses tika-python as the User-Agent
  3. allows for streaming, etc.

@afuetterer
Copy link
Contributor Author

Please have a look at #446 and #447 in the meantime. These changes have been extracted from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants