Skip to content

Caching of documents leaves a lot to be desired

When working on #458 (closed), I stumbled upon this entry in the context database:

Screenshot_2025-09-15_at_11.44.15

As you can see the hashes are different, but the title is the same. Looking into the payloads, reveals the following:

{
    "bdd": "OpenAlex"
  , "doi": "https://doi.org/10.1145/1411286.1411290"
  , "page": 25
  , "title": "Haskell session types with (almost) no class"
  , "authors": "Riccardo Pucella
  , Jesse A. Tov"
  , "abstract": "We describe an implementation of session types in Haskell. Session types statically enforce that client-server communication proceeds according to protocols. They have been added to several concurrent calculi
  , but few implementations of session types are available."
  , "institutes": "Northeastern University
  , Northeastern University"
  , "language_iso2": "en"
  , "publication_day": 25
  , "publication_date": "2008-09-25T00:00:00"
  , "publication_year": 2008
  , "publication_month": 9
}

{
    "bdd": "OpenAlex"
  , "doi": "https://doi.org/10.1145/1543134.1411290"
  , "page": 25
  , "title": "Haskell session types with (almost) no class"
  , "source": "ACM SIGPLAN Notices"
  , "authors": "Riccardo Pucella
  , Jesse A. Tov"
  , "abstract": "We describe an implementation of session types in Haskell. Session types statically enforce that client-server communication proceeds according to protocols. They have been added to several concurrent calculi
  , but few implementations of session types are available. Our embedding takes advantage of Haskell where appropriate
  , but we rely on no exotic features. Thus our approach translates with minimal modification to other polymorphic
  , typed languages such as ML and Java. Our implementation works with existing Haskell concurrency mechanisms
  , handles multiple communication channels and recursive session types
  , and infers protocols automatically. While our implementation uses unsafe operations in Haskell
  , it does not violate Haskell's safety guarantees. We formalize this claim in a concurrent calculus with unsafe communication primitives over which we layer our implementation of session types
  , and we prove that the session types layer is safe. In particular
  , it enforces that channel-based communication follows consistent protocols."
  , "institutes": "Northeastern University
  , Northeastern University"
  , "language_iso2": "en"
  , "publication_day": 25
  , "publication_date": "2008-09-25T00:00:00"
  , "publication_year": 2008
  , "publication_month": 9
}

Essentially these are the same paper, but due to the fact the doi is slightly different, the abstract is slightly different (albeit overlapping) and one contains an extra source field, the final hash was different, making GGTX consider this as a unique document.

This is obviously a trick problem to solve, but I also wonder how we could curate data to correctly address the concept of "similarity" here.