Proto commits in MarginaliaSearch/MarginaliaSearch

These 62 commits are when the Protocol Buffers files have changed:

Commit:fc13884
Author:Viktor Lofgren
Committer:Viktor Lofgren

(actor) Add the ability to filter sample data based on content type This will help in extracting relevant test sets for PDF processing.

The documentation is generated from this commit.

Commit:cfd4712
Author:Viktor Lofgren

(favicon) Add capability for fetching favicons

Commit:a84a069
Author:Viktor Lofgren

(ranking-params) Add disable penalties flag to ranking params This will help debugging ranking issues. Later it may be added to some filters.

Commit:9be477d
Author:Viktor Lofgren

(domain-info) Add a feed flag to domain info This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.

Commit:94d4d2e
Author:Viktor Lofgren

(live-crawler) Add refresh date to feeds API For now this is just the ctime for the feeds db. We may want to store this per-record in the future.

Commit:d4bce13
Author:Viktor Lofgren

(export) Add export actors to precession Adding a tracking message to the export actor means it's possible to run them in a precession. Adding a new precession actor, and some GUI components for triggering exports. The change also adds a heartbeat to the export process.

Commit:47dfbac
Author:Viktor Lofgren

(conf) Introduce a new concept of node profiles Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.

Commit:c728a1e
Author:Viktor Lofgren

(rss) Add endpoint for extracting URLs changed withing a timespan.

Commit:d874d76
Author:Viktor Lofgren

(rss) Add an endpoint that can be used for identifying when RSS data has changed

Commit:a2bc9a9
Author:Viktor Lofgren
Committer:Viktor Lofgren

(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished

Commit:e24a983
Author:Viktor Lofgren

(feed) Update API to allow specifying clean vs refresh update Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.

Commit:bfeb9a4
Author:Viktor Lofgren
Committer:Viktor Lofgren

(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service

Commit:d84a2c1
Author:Viktor Lofgren

(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.

Commit:23cce0c
Author:Viktor Lofgren
Committer:Viktor Lofgren

Add a new function 'Live Capture' for on-demand screenshot capture The screenshots are requested by the site-service, and triggered via the site-info view.

Commit:73f973c
Author:Viktor Lofgren
Committer:Viktor Lofgren

(search-query) Add pagination to search query API and the direct query-service interface

Commit:9aa8f13
Author:Viktor Lofgren

(index) Remove tcfAvgDist ranking parameter This is captured by tcfProximity already

Commit:0999f07
Author:Viktor Lofgren

(search-query) Add new ranking parameters for proximity and verbatim matches

Commit:03d5dec
Author:Viktor Lofgren

(*) Refactor termCoherences and rename them to phrase constraints.

Commit:2e89b55
Author:Viktor Lofgren

(wip) Repair qdebug utility and show new ranking details

Commit:aebb265
Author:Viktor Lofgren

(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.

Commit:dfd19b5
Author:Viktor Lofgren
Committer:Viktor Lofgren

(index) Reduce the number of abstractions around result ranking The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.

Commit:8ed5b51
Author:Viktor
Committer:GitHub

Merge branch 'master' into term-positions

Commit:ad38579
Author:Viktor Lofgren
Committer:Viktor Lofgren

(search-api, ranking) Update with new ranking parameters Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm. The change also cleans out several parameters that no longer filled any function.

Commit:d86926b
Author:Viktor Lofgren
Committer:Viktor Lofgren

(crawl) Add new functionality for re-crawling a single domain

Commit:9d00243
Author:Viktor Lofgren

(index) Partial re-implementation of position constraints

Commit:eb74d08
Author:Viktor Lofgren
Committer:Viktor Lofgren

(qs) Additional info in query debug UI

Commit:155be10
Author:Viktor Lofgren
Committer:Viktor Lofgren

(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.

Commit:462aa9a
Author:Viktor Lofgren
Committer:Viktor Lofgren

(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.

Commit:6efc0f2
Author:Viktor Lofgren
Committer:Viktor Lofgren

(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.

Commit:b80a833
Author:Viktor Lofgren
Committer:Viktor Lofgren

(qs) Additional info in query debug UI

Commit:a3a6d62
Author:Viktor Lofgren
Committer:Viktor Lofgren

(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.

Commit:4fb86ac
Author:Viktor Lofgren
Committer:Viktor Lofgren

(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.

Commit:212d101
Author:Viktor Lofgren
Committer:Viktor Lofgren

(control) GUI for exporting segmentation data from a wikipedia zim

Commit:9b06433
Author:Viktor Lofgren

(qs) Additional info in query debug UI

Commit:def607d
Author:Viktor Lofgren

(qs) Additional info in query debug UI

Commit:7641a02
Author:Viktor Lofgren

(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.

Commit:599e719
Author:Viktor Lofgren

(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.

Commit:b6d365b
Author:Viktor Lofgren

(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.

Commit:fcdc843
Author:Viktor Lofgren
Committer:Viktor Lofgren

(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.

Commit:81815f3
Author:Viktor Lofgren

(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.

Commit:afc047c
Author:Viktor Lofgren

(control) GUI for exporting segmentation data from a wikipedia zim

Commit:4642361
Author:Viktor Lofgren

(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.

Commit:9f16496
Author:Viktor Lofgren
Committer:Viktor Lofgren

Clean up documentation and rename `domain-links` to `link-graph`

Commit:427f3e9
Author:Viktor Lofgren
Committer:Viktor Lofgren

(index) Retire count operation, clean up index code.

Commit:56d35aa
Author:Viktor Lofgren

(refac) Move execution API out of executor service

Commit:3fd2a83
Author:Viktor Lofgren

* Extract the search-query function

Commit:66c1281
Author:Viktor Lofgren

(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.

Commit:0307c55
Author:Viktor Lofgren

(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.

Commit:66b3e71
Author:Viktor Lofgren

(search) Expose more search options This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias. The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period. These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well. The vintage filter is modified to add a temporal bias for the past.

Commit:fab36d6
Author:Viktor Lofgren
Committer:Viktor Lofgren

(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.

Commit:e8de468
Author:Viktor
Committer:GitHub

Make executor API talk GRPC (#75) * (executor-api) Make executor API talk GRPC The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed... The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients. ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name(). The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.

Commit:edc1acb
Author:Viktor Lofgren

(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.

Commit:4763077
Author:Viktor Lofgren

(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.

Commit:6bac3c7
Author:Viktor Lofgren

(api) API documentation

Commit:c2b28c0
Author:Viktor Lofgren

(api) Trial streaming API

Commit:a860f8f
Author:Viktor Lofgren
Committer:Viktor Lofgren

(index/qs) GRPC API for better query peformance

Commit:b4051c3
Author:Viktor Lofgren

Remove old unused protobuf crap

This commit does not contain any .proto files.

Commit:6b44786
Author:Viktor Lofgren

2022-11 release (#133) Co-authored-by: vlofgren <vlofgren@gmail.com> Co-authored-by: vlofgren <vlofgren@marginalia.nu> Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/133

Commit:6d33c38
Author:Viktor Lofgren

Merge changes from experimental branch (#132) Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/132

Commit:0a35a7c
Author:Viktor Lofgren

master (#119) Co-authored-by: vlofgren <vlofgren@gmail.com> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/119

Commit:df49ccb
Author:Viktor Lofgren

October Release (#118) Co-authored-by: vlofgren <vlofgren@gmail.com> Co-authored-by: vlofgren <vlofgren@marginalia.nu> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/118

Commit:3200c36
Author:vlofgren

Experimental changes for 22-08/09 update.