-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pdf checksums instead of file paths for indexing? #924
Comments
I mostly follow what you're saying but want to confirm. Are you talking about:
One way to approach the first bullet is to use relative paths by setting Actually this is what we do internally at FutureHouse. This will trigger paths to be relative paths, and should be resilient to root directory renames |
I'm thinking of the key in the index . I noticed the Line 405 in 29135d0
False , which is what I want?
I've only just started experimenting with paper-qa, so it's also completely possible that I am just doing something untoward in the settings. Steps to recreate (on my end..):
test.py
When I run the the first time, the paper is indexed and queried as expected. Running the script a second time without changing the directory name skips the indexing and goes straight to the query, as expected. Once I rename the "test1" directory to something else, however, paper-qa treats it as new again:
Version: 17fb0a3 (Mar 29, 2025) |
At present, paper-qa seems to create indices indexed by file path.
For example, renaming the root document directory for
papers
topapers2
causes paper-qa to treat all files inside as "new".Would it be possible to avoid using file paths and instead use something like md5 hashes of the PDFs?
This way the input papers can be moved / reorganized without having to recompute the indices.
The text was updated successfully, but these errors were encountered: