Use pdf checksums instead of file paths for indexing? #924

khughitt · 2025-03-29T19:48:55Z

At present, paper-qa seems to create indices indexed by file path.

For example, renaming the root document directory for papers to papers2 causes paper-qa to treat all files inside as "new".

Would it be possible to avoid using file paths and instead use something like md5 hashes of the PDFs?

This way the input papers can be moved / reorganized without having to recompute the indices.

The text was updated successfully, but these errors were encountered:

jamesbraza · 2025-03-29T23:27:42Z

I mostly follow what you're saying but want to confirm. Are you talking about:

The key in the index: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L313
The autogenerated index name: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/settings.py#L774-L779

One way to approach the first bullet is to use relative paths by setting settings.agent.index.use_absolute_paper_directory to False: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L509-L510

Actually this is what we do internally at FutureHouse.

This will trigger paths to be relative paths, and should be resilient to root directory renames

khughitt · 2025-03-30T17:42:03Z

I'm thinking of the key in the index .

I noticed the settings.agent.index.use_absolute_paper_directory option, but from the README and code (

paper-qa/paperqa/settings.py

Line 405 in 29135d0

use_absolute_paper_directory: bool = Field(

), it looks like it defaults to False, which is what I want?

I've only just started experimenting with paper-qa, so it's also completely possible that I am just doing something untoward in the settings.

Steps to recreate (on my end..):

create a directory ("test1") with a single paper
python test.py
mv test1 test2
python test.py

test.py

from paperqa import Settings, ask

local_llm_config = {
    "model_list": [
        {
            "model_name": "ollama/deepseek-r1:1.5b",
            "litellm_params": {
                "model": "ollama/deepseek-r1:1.5b",
                "api_base": "http://localhost:11434"
            }
        }
    ]
}

answer = ask(
    "What are the most important approaches used in drug discovery?",
    settings=Settings(
        paper_directory="/path/to/test1",
        llm="ollama/deepseek-r1:1.5b",
        embedding="ollama/nomic-embed-text",
        llm_config=local_llm_config,
        summary_llm="ollama/deepseek-r1:1.5b",
        summary_llm_config=local_llm_config
    )
)

When I run the the first time, the paper is indexed and queried as expected.

Running the script a second time without changing the directory name skips the indexing and goes straight to the query, as expected.

Once I rename the "test1" directory to something else, however, paper-qa treats it as new again:

[13:25:05] New file to index: xx.pdf...

Version: 17fb0a3 (Mar 29, 2025)

dosubot bot added the enhancement New feature or request label Mar 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pdf checksums instead of file paths for indexing? #924

Use pdf checksums instead of file paths for indexing? #924

khughitt commented Mar 29, 2025

jamesbraza commented Mar 29, 2025

khughitt commented Mar 30, 2025

Use pdf checksums instead of file paths for indexing? #924

Use pdf checksums instead of file paths for indexing? #924

Comments

khughitt commented Mar 29, 2025

jamesbraza commented Mar 29, 2025

khughitt commented Mar 30, 2025