Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pdf checksums instead of file paths for indexing? #924

Open
khughitt opened this issue Mar 29, 2025 · 2 comments
Open

Use pdf checksums instead of file paths for indexing? #924

khughitt opened this issue Mar 29, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@khughitt
Copy link

At present, paper-qa seems to create indices indexed by file path.

For example, renaming the root document directory for papers to papers2 causes paper-qa to treat all files inside as "new".

Would it be possible to avoid using file paths and instead use something like md5 hashes of the PDFs?

This way the input papers can be moved / reorganized without having to recompute the indices.

@dosubot dosubot bot added the enhancement New feature or request label Mar 29, 2025
@jamesbraza
Copy link
Collaborator

I mostly follow what you're saying but want to confirm. Are you talking about:

One way to approach the first bullet is to use relative paths by setting settings.agent.index.use_absolute_paper_directory to False: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L509-L510

Actually this is what we do internally at FutureHouse.

This will trigger paths to be relative paths, and should be resilient to root directory renames

@khughitt
Copy link
Author

I'm thinking of the key in the index .

I noticed the settings.agent.index.use_absolute_paper_directory option, but from the README and code (

use_absolute_paper_directory: bool = Field(
), it looks like it defaults to False, which is what I want?

I've only just started experimenting with paper-qa, so it's also completely possible that I am just doing something untoward in the settings.

Steps to recreate (on my end..):

  1. create a directory ("test1") with a single paper
  2. python test.py
  3. mv test1 test2
  4. python test.py

test.py

from paperqa import Settings, ask

local_llm_config = {
    "model_list": [
        {
            "model_name": "ollama/deepseek-r1:1.5b",
            "litellm_params": {
                "model": "ollama/deepseek-r1:1.5b",
                "api_base": "http://localhost:11434"
            }
        }
    ]
}

answer = ask(
    "What are the most important approaches used in drug discovery?",
    settings=Settings(
        paper_directory="/path/to/test1",
        llm="ollama/deepseek-r1:1.5b",
        embedding="ollama/nomic-embed-text",
        llm_config=local_llm_config,
        summary_llm="ollama/deepseek-r1:1.5b",
        summary_llm_config=local_llm_config
    )
)

When I run the the first time, the paper is indexed and queried as expected.

Running the script a second time without changing the directory name skips the indexing and goes straight to the query, as expected.

Once I rename the "test1" directory to something else, however, paper-qa treats it as new again:

[13:25:05] New file to index: xx.pdf...

Version: 17fb0a3 (Mar 29, 2025)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants