These 62 commits are when the Protocol Buffers files have changed:
Commit: | fc13884 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(actor) Add the ability to filter sample data based on content type This will help in extracting relevant test sets for PDF processing.
The documentation is generated from this commit.
Commit: | cfd4712 | |
---|---|---|
Author: | Viktor Lofgren |
(favicon) Add capability for fetching favicons
Commit: | a84a069 | |
---|---|---|
Author: | Viktor Lofgren |
(ranking-params) Add disable penalties flag to ranking params This will help debugging ranking issues. Later it may be added to some filters.
Commit: | 9be477d | |
---|---|---|
Author: | Viktor Lofgren |
(domain-info) Add a feed flag to domain info This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
Commit: | 94d4d2e | |
---|---|---|
Author: | Viktor Lofgren |
(live-crawler) Add refresh date to feeds API For now this is just the ctime for the feeds db. We may want to store this per-record in the future.
Commit: | d4bce13 | |
---|---|---|
Author: | Viktor Lofgren |
(export) Add export actors to precession Adding a tracking message to the export actor means it's possible to run them in a precession. Adding a new precession actor, and some GUI components for triggering exports. The change also adds a heartbeat to the export process.
Commit: | 47dfbac | |
---|---|---|
Author: | Viktor Lofgren |
(conf) Introduce a new concept of node profiles Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
Commit: | c728a1e | |
---|---|---|
Author: | Viktor Lofgren |
(rss) Add endpoint for extracting URLs changed withing a timespan.
Commit: | d874d76 | |
---|---|---|
Author: | Viktor Lofgren |
(rss) Add an endpoint that can be used for identifying when RSS data has changed
Commit: | a2bc9a9 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
Commit: | e24a983 | |
---|---|---|
Author: | Viktor Lofgren |
(feed) Update API to allow specifying clean vs refresh update Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
Commit: | bfeb9a4 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service
Commit: | d84a2c1 | |
---|---|---|
Author: | Viktor Lofgren |
(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
Commit: | 23cce0c | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
Add a new function 'Live Capture' for on-demand screenshot capture The screenshots are requested by the site-service, and triggered via the site-info view.
Commit: | 73f973c | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(search-query) Add pagination to search query API and the direct query-service interface
Commit: | 9aa8f13 | |
---|---|---|
Author: | Viktor Lofgren |
(index) Remove tcfAvgDist ranking parameter This is captured by tcfProximity already
Commit: | 0999f07 | |
---|---|---|
Author: | Viktor Lofgren |
(search-query) Add new ranking parameters for proximity and verbatim matches
Commit: | 03d5dec | |
---|---|---|
Author: | Viktor Lofgren |
(*) Refactor termCoherences and rename them to phrase constraints.
Commit: | 2e89b55 | |
---|---|---|
Author: | Viktor Lofgren |
(wip) Repair qdebug utility and show new ranking details
Commit: | aebb265 | |
---|---|---|
Author: | Viktor Lofgren |
(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
Commit: | dfd19b5 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(index) Reduce the number of abstractions around result ranking The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
Commit: | 8ed5b51 | |
---|---|---|
Author: | Viktor | |
Committer: | GitHub |
Merge branch 'master' into term-positions
Commit: | ad38579 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(search-api, ranking) Update with new ranking parameters Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm. The change also cleans out several parameters that no longer filled any function.
Commit: | d86926b | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(crawl) Add new functionality for re-crawling a single domain
Commit: | 9d00243 | |
---|---|---|
Author: | Viktor Lofgren |
(index) Partial re-implementation of position constraints
Commit: | eb74d08 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(qs) Additional info in query debug UI
Commit: | 155be10 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
Commit: | 462aa9a | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
Commit: | 6efc0f2 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
Commit: | b80a833 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(qs) Additional info in query debug UI
Commit: | a3a6d62 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
Commit: | 4fb86ac | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.
Commit: | 212d101 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(control) GUI for exporting segmentation data from a wikipedia zim
Commit: | 9b06433 | |
---|---|---|
Author: | Viktor Lofgren |
(qs) Additional info in query debug UI
Commit: | def607d | |
---|---|---|
Author: | Viktor Lofgren |
(qs) Additional info in query debug UI
Commit: | 7641a02 | |
---|---|---|
Author: | Viktor Lofgren |
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
Commit: | 599e719 | |
---|---|---|
Author: | Viktor Lofgren |
(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
Commit: | b6d365b | |
---|---|---|
Author: | Viktor Lofgren |
(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
Commit: | fcdc843 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.
Commit: | 81815f3 | |
---|---|---|
Author: | Viktor Lofgren |
(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
Commit: | afc047c | |
---|---|---|
Author: | Viktor Lofgren |
(control) GUI for exporting segmentation data from a wikipedia zim
Commit: | 4642361 | |
---|---|---|
Author: | Viktor Lofgren |
(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.
Commit: | 9f16496 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
Clean up documentation and rename `domain-links` to `link-graph`
Commit: | 427f3e9 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(index) Retire count operation, clean up index code.
Commit: | 56d35aa | |
---|---|---|
Author: | Viktor Lofgren |
(refac) Move execution API out of executor service
Commit: | 3fd2a83 | |
---|---|---|
Author: | Viktor Lofgren |
* Extract the search-query function
Commit: | 66c1281 | |
---|---|---|
Author: | Viktor Lofgren |
(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.
Commit: | 0307c55 | |
---|---|---|
Author: | Viktor Lofgren |
(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.
Commit: | 66b3e71 | |
---|---|---|
Author: | Viktor Lofgren |
(search) Expose more search options This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias. The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period. These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well. The vintage filter is modified to add a temporal bias for the past.
Commit: | fab36d6 | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.
Commit: | e8de468 | |
---|---|---|
Author: | Viktor | |
Committer: | GitHub |
Make executor API talk GRPC (#75) * (executor-api) Make executor API talk GRPC The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed... The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients. ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name(). The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
Commit: | edc1acb | |
---|---|---|
Author: | Viktor Lofgren |
(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
Commit: | 4763077 | |
---|---|---|
Author: | Viktor Lofgren |
(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.
Commit: | 6bac3c7 | |
---|---|---|
Author: | Viktor Lofgren |
(api) API documentation
Commit: | c2b28c0 | |
---|---|---|
Author: | Viktor Lofgren |
(api) Trial streaming API
Commit: | a860f8f | |
---|---|---|
Author: | Viktor Lofgren | |
Committer: | Viktor Lofgren |
(index/qs) GRPC API for better query peformance
Commit: | b4051c3 | |
---|---|---|
Author: | Viktor Lofgren |
Remove old unused protobuf crap
This commit does not contain any .proto
files.
Commit: | 6b44786 | |
---|---|---|
Author: | Viktor Lofgren |
2022-11 release (#133) Co-authored-by: vlofgren <vlofgren@gmail.com> Co-authored-by: vlofgren <vlofgren@marginalia.nu> Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/133
Commit: | 6d33c38 | |
---|---|---|
Author: | Viktor Lofgren |
Merge changes from experimental branch (#132) Co-authored-by: Viktor Lofgren <vlofgren@marginalia.nu> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/132
Commit: | 0a35a7c | |
---|---|---|
Author: | Viktor Lofgren |
master (#119) Co-authored-by: vlofgren <vlofgren@gmail.com> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/119
Commit: | df49ccb | |
---|---|---|
Author: | Viktor Lofgren |
October Release (#118) Co-authored-by: vlofgren <vlofgren@gmail.com> Co-authored-by: vlofgren <vlofgren@marginalia.nu> Reviewed-on: https://git.marginalia.nu/marginalia/marginalia.nu/pulls/118
Commit: | 3200c36 | |
---|---|---|
Author: | vlofgren |
Experimental changes for 22-08/09 update.