Get desktop application:
View/edit binary Protocol Buffers messages
NextId: 11
Used in:
,Number of times we perform single-linkage clustering. If num_iterations = 0, produces a clustering in which each node is in its own cluster. Note that setting this to k is equivalent to setting compression_rounds parameter of distributed affinity clustering to k-1. In sequential affinity clustering, the number of levels in the clustering hierarchy produced is always equal to num_iterations. However, in parallel affinity clustering, the number of levels may be fewer than the number of iterations, where omitted clustering levels are equal to the last returned level.
Specifies the edge weight threshold in each iteration of the clustering. In each iteration, edges of weight smaller than the threshold are ignored.
A fixed threshold in each round.
A fixed sequence of the thresholds. NOTE: If num_iterations > length of the list, then the last threshold specified is used for all iterations beyond the length of the list.
Dynamically change the weight threshold in each iteration.
Possible ways in which a cluster may qualify as "active". A cluster is active if it satisfies at least one of the conditions listed in this field. If the field is empty, every cluster is active.
The percentile value (from 0 to 1) of edge weight distributions to use when determining which clusters to merge. Only used for PERCENTILE EdgeAggregationFunction.
The minimum number of edges required between two clusters to be possibly merged via PERCENTILE EdgeAggregationFunction. If the minimum number of edges are not present, then MAX EdgeAggregationFunction is used.
If true, use node weights in input to calculate cluster sizes. Otherwise, use node counts to calculate cluster sizes (i.e., use default node weight of 1).
Specifies a set of conditions that qualify cluster as "active". An unset field defines a condition that's always satisfied.
Used in:
Minimum density, that is total edge weight divided by number of (unordered) node pairs.
Sum of weights of edges leaving the cluster divided by the total weight of edges incident to all nodes in the cluster.
Specifies how edge weights are aggregated when computing a compressed graph for subsequent iterations. Let S = set of edge weights between two clusters, X, Y = total number of nodes in each cluster. With these definitions, we use the following formulas:
Used in:
sum(S) / (X*Y)
max(S)
sum(S)
sum(S) / min(X, Y)
Let s_0, ..., s_{|S|-1} be a nondecreasing ordering of S. Then, pick edge weight s_N such that N is percentile_linkage_value * (|S|-1).
sum(S) / count(S)
Specifies size constraints.
Used in:
Enforces a hard constraint on the maximum total weight of nodes in a cluster. Note that by default each node has a weight of 1. Set `use_node_weight_for_cluster_size` to true in order to use input node weights to calculate cluster sizes.
Desired minimum total weight of nodes in a cluster. Note that by default each node has a weight of 1. Set `use_node_weight_for_cluster_size` to true in order to use input node weights to calculate cluster sizes. NOTE: This is a soft constraint, and returned clusters may be smaller than the desired minimum if: - the graph contains connected components with size less than the minimum. - the configuration prevents sufficient merging (due to not enough compression rounds, too high of an edge weight threshold). - merging clusters would violate the maximum size constraint, if present.
prefer_min_cluster_size is effective only when min_cluster_size is set. When min_cluster_size is set, we do *not* try to maximize the cluster sizes. The behavior for honoring min_cluster_size depends on prefer_min_cluster_size. If false (default), a cluster does not initiate a merge with another cluster, if the size of the former is above the min size constraint upon entry to a compression round. If true, clusters will merge only if at least one of the clusters is below the min_cluster_size throughout a compression round. This will tend the result towards clusters near min_cluster_size.
Desired target total weight of nodes in a cluster. Note that by default each node has a weight of 1. NOTE: This provides (1) a soft constraint for minimum size and (2) a hard constraint for maximum size. For minimimum size constraint: Returned clusters may be smaller than the desired minimum if: - the graph contains connected components with total weight less than the minimum. - the configuration prevents sufficient merging (due to not enough compression rounds, too high of an edge weight threshold). For maximum size constraint: Returned clusters can have weight at most 2 * target_cluster_size * compression_rounds + maximum_node_weight. In particular, if an affinity tree in a round has total weight at least target_cluster_size, then the affinity tree will be partitioned into clusters where each cluster has total weight at least target_cluster_size such that each cluster - has total weight at most 4 * target_cluster_size OR - has a total weight at most 2 * target_cluster_size after removing the node in the current affinity tree with the largest weight. Note: target_cluster_size is used to handle affinity trees after all max_cluster_size, min_cluster_size, prefer_min_cluster_size and merge_allowance_config are processed.
Used in:
Used in:
The minimum density of the clique containers to be returned.
Config for InMemoryClusterer subclasses. When adding a new subclass: a) add a new proto definition to this file, b) add this proto to the config oneof in ClustererConfig. Next available tag: 20
Used in:
Use this to pass parameters to experimental partitioners and partitioners defined in unittests.
Specification of an in-memory clustering step, consisting of the clusterer name and its config.
Config for CoconductanceClusterer, which optimizes for co-conductance, which is a clustering quality measure defined as follows. Given an undirected graph with weighted edges, for a cluster C define: * vol(C) = total weighted degree of nodes within C * E(C) = total weight of all undirected edges with both endpoints in C * ccond(C) = 2E(C) / vol(C) Then, given an exponent p > 0, we define the coconductance of the clustering C_1, ..., C_k to be sum ccond(C_i)^p for i = i, ..., k This seems to be a reasonable method for finding dense clusters, even in the absence of weights. See go/coconductance-paper for details. The algorithm optimizes co-conductance using Louvain algorithm.
Used in:
Deprecated. Use louvain.exponent() instead.
Which algorithm to use to optimize for co-conductance. By default we use Louvain.
Use the Louvain method, adapted to co-conductance setting
Use the constant-approximate algorithm from go/coconductance-paper. This is a theoretical algorithm available for benchmarking purposes. It is not recommended to use it in practice.
Used in:
Used in:
Exponent used in the objective formula. Increasing the exponent results in a smaller number of clusters. The reasonable range of exponents seems to be [0.1, 10]. Note that the objective is calculated using double type, so for large exponents (> 100) the algorithm may suffer from floating-point precision issues.
Consider a graph with vertex set V, edge set E, non-negative vertex weights k_u, edge weights w_uv, and a "resolution" parameter which must be non-negative. We define "rescaled" edge weights w'_uv for all u, v, in V as: { 0 if u == v { w_uv - edge_weight_offset - if {u, v} in E, w'_{uv} = { resolution k_u k_v { -resolution k_u k_v otherwise Note that when using graph_mining.Node protos to store the graph, the edge and node weights default to 1. For bipartite objective computation, see comments preceding `use_bipartite_objective`. The correlation clustering objective is to maximize sum_{u, v in the same cluster} w'_uv, which is equivalent (up to sign and an additive constant) to the "maximizing agreements" and "minimizing disagreements" formulations of correlation clustering that are used in the approximation algorithms literature. Assuming the total edge weight in the graph is M, modularity partitioning can be expressed in this form by: * setting resolution = 1/(2*M), * setting the weight of each node to the total weight of its incident edges. Note that the final correlation clustering objective is monotonic in, but not equal to modularity. In particular, if the correlation clustering objective is C, we have: modularity = (C - resolution * sum_v (deg_v)^2 / (4 * M)) / M. This special case is available as ModularityClusterer in this library. To optimize this objective we use local search. We start with each vertex in its own cluster. We consider moves of the following form: move all vertices in a "move set" S of vertices to either one of the existing clusters or to a newly created cluster. We currently consider the following options for S: - Each vertex in a singleton move set. This reduces to the classic single vertex moves. - One move set per current cluster with all the vertices currently in it. With these move sets we're effectively considering merging two clusters. - One move set per cluster from single-level affinity clustering. - One move set per cluster from a run of the Ailon Charikar Newman approximation algorithm. The clusters produced by three runs of the Ailon Charikar Newman algorithm (it's randomized) are used as move sets each iteration. The local search proceeds with effectively three nested loops. The outermost loop is over the num_iterations iterations. The middle loop is over the four move types listed above. The inner loop is over move sets of the particular type. For each move set considered we move that move set to the cluster that improves the objective the most if an improving move exists. Next available tag: 15
Used in:
,Parameters used by both CorrelationClusterer and ParallelCorrelationClusterer The next two fields control how the rescaled edge weights are calculated. See comment above CorrelationClustererConfig.
Parameters only used by Correlation Clusterer random_seed is no longer supported due to migration to absl::BitGen and will return an error if set.
Number of local improvement iterations. Each iteration has runtime linear in the number of edges. By default, or if non-positive, a reasonable value is used, currently 10.
Enables the Affinity local move type.
The number of Ailon Charikar Newman clusterings to generate for local moves each iteration. Zero, the default, disables the ACN moves. The reasonable values are roughly [0, 3].
Parameters if Louvain is chosen for the clustering_moves_method.
Specifies whether correlation clustering should perform moves synchronously, which preserves consistency guarantees. Using the synchronous setting may produce poorer objective compared to the asynchronous setting due to a lack of symmetry breaking, but it is deterministic assuming fixed vertex ids in the input (whereas the asynchronous setting is non-deterministic). The asynchronous setting is up to 2.50x faster (median of 1.21x) and gives between a 1.29 -- 156.01% increase in objective. See go/correlation-clustering-paper for details.
Specifies whether correlation clustering should perform multi-level refinement. Multi-level refinement may improve the objective (between a 1.12 -- 36.92% increase in objective), but at the cost of speed (up to a 2.29x slowdown with a median of 1.67x) and space (between a 1.40 -- 23.68x memory overhead over the size of the input graph, whereas without refinement, the memory overhead is 1.25 -- 3.24x). See go/correlation-clustering-paper for details.
Specifies whether to use extra space for temporarily holding new clusters. This should be used only with use_synchronous == false.
Specifies whether to use bipartite correlation objective computation. When set to true, we adjust the correlation objective computation for a bipartite graph setting, where we do not penalize missing edges among nodes within the same graph partition. When the algorithm is run on a graph in go/graph-format, the "side" of a node is determined by the graph_mining.Node.part integer which should be set to 0 or 1. Specifically, we have { 0 if u == v { w_uv - edge_weight_offset - if {u, v} in E, w'_{uv} = { resolution k_u k_v { 0 if {u, v} is not in E and { partition(u) == partition(v) { -resolution k_u k_v otherwise
Specifies the algorithm to use for correlation clustering.
Used in:
The default method is the cluster moves method for sequential correlation clustering (CorrelationClusterer), and the Louvain method for parallel correlation clustering (ParallelCorrelationClusterer). The Louvain method offers performance and quality improvements in the parallel setting.
This method performs the classic Louvain algorithm, where after rounds of best moves converge, the algorithm compresses clusters into nodes and then repeats this process on the compressed graph. The parameters using this algorithm are given in louvain_config.
This method involves alternating between single vertex best moves and entire cluster best moves. An iteration consists of one round of single vertex best moves and one round of entire cluster best moves. The number of iterations is as given in num_iterations.
Dynamic hierarchical agglomerative clustering (HAC) using the average linkage function.
This parameter controls the accuracy of the Hac algorithm. Specifically, the output dendrogram is 1+epsilon approximate.
Defines the stopping condition for producing the dendrogram maintained by the algorithm. Specifically the algorithm merges clusters until there are no clusters of similarity > `weight_threshold`. When accessing the dendrogram (via `Dendrogram()` method), this is the lowest weight threshold for which the resulting dendrogram can be meaningfully flattened, i.e., flattening the dendrogram with a threshold lower than the value specified here gives a clustering with no approximation guarantees. The higher the value is, the faster the algorithm becomes.
Config for algorithm used for linear embedding of the input graph.
Used in:
Please see AffinityConfigWithDefaults function in affinity_hierarchy_embedder.cc for the default values used for a few unset fields. Also any EdgeAggregationFunction that relies on node weights (such as AVERAGE) is currently not supported.
TODO: Add SortingLSH and ParHAC embedding options.
Shared-memory implementation of parallel label propagation. The initial label of each node is its own id. In each round, each node that had a neighbor update its label on the previous round recomputes it label as the most frequent label among its neighborhood. Edge weights are summed up to compute the max-frequency label. Note that the order that nodes are updated will depend on the options set in the config below.
Used in:
The maximum number of rounds to run.
Color the graph using greedy graph coloring, and then process each color set in parallel in sub-rounds within a round. Enabling this flag ensures that the algorithm is deterministic. TODO: benchmark the slowdown incurred by using coloring and check if we should set this option to true by default.
Config for the parallel line partitioner (go/parline).
Used in:
Required. Either the number of clusters or the desired (average) cluster weight must be specified. Value provided must be greater than zero.
Current implementation uses std::ceil(total_node_weight / cluster_weight) to convert this to number of clusters.
Imbalance desired. Each cluster weight is targeted to be at most max((1 + imbalance) W / k, W / k + max_{v in V} w(v)) where W is the total node weight, w(v) is the weight of node v, and k is the number of clusters.
Method used to embed nodes of a graph into a line (or circle).
If true then a cluster weight is computed as the sum of node weights in it (otherwise it is the number of nodes).
Post processing local search done on clusters to improve the quality.
Used in:
This config is for clustering using the Louvain algorithm, where the objective is given by another config.
Used in:
Max number of rounds (of best moves and compression) to run.
Number of best moves rounds to run. This is primarily for parallel Louvain, which may not terminate otherwise.
Max number of iterations. An iteration in this context is defined as one round of location improvement for all the nodes in the graph. See kDefaultMaxIterations in minla.cc for the default value used if not specified.
Convergence delta. Used as a termination criteria if the cost difference between two subsequent iterations stays within this delta. Note that this cost is measured in terms of the intermediate placements computed by the iterative median/mean algorithm and not the integer locations of the implied linear arrangement.
Used in:
When unspecified, the default is L1_COST_METRIC.
Weighted L1 cost metric defined as: \sum w_ij*abs(l_i - l_j) where w_ij is the edge weight between node i and node j l_i is the location of node i in range [0, n) NOTE: Both (i,j) and (j,i) are included in the sum.
Weighted L2 cost metric defined as: \sum w_ij*(l_i - l_j)^2 where w_ij is the edge weight between node i and node j l_i is the location of node i in range [0, n) NOTE: Both (i,j) and (j,i) are included in the sum.
Used in:
This parameter controls the weight of the null model in modularity. If it is higher, "more of" the null model is subtracted from each edge, so links are "less important", thus communities break up, leading to smaller clusters. Semantically, "higher resolutions" = "smaller clusters" Reasonable settings for the parameter depend on the application, ranging from 0.5 to 3.0. If possible, tune with Metaclusterer. Note that resolution = 0.0 will recover connected components, while there always exists a large enough resolution such that all communities are singletons. For some details: - https://arxiv.org/pdf/cond-mat/0603718.pdf - introduction of the param - https://arxiv.org/pdf/1107.1155.pdf - analysis of the param
The current implementation of ModularityClusterer uses CorrelationClusterer under the hood. The resolution and edge_weight_offset fields of the provided correlation config will be overridden to acheieve the modularity objective. Other fields will be passed as provided.
Used in:
Number of improvement iterations where an iteration is defined as performing local search among all the paired clusters.
Cluster pairing method used when two clusters are chosen to swap nodes.
Used in:
Required when pairing name is set to DISTANCE and must be at least 1.
Used in:
See OddEvenPairingScheme function pairing_scheme.h
See DistancePairingScheme function pairing_scheme.h
Parallel hierarchical agglomerative clustering (HAC) using the average linkage function. The algorithm that we use is described in go/parhac-paper.
Used in:
This parameter controls the accuracy of the ParHac algorithm. Specifically, the ParHac algorithm computes a (1+epsilon)^2-approximate dendrogram for the average-linkage measure (UPGMA-linkage). See https://en.wikipedia.org/wiki/Hierarchical_clustering for a high-level overview of the sequential HAC algorithm, and https://en.wikipedia.org/wiki/UPGMA for the average-linkage measure. In more detail, an alpha-approximate ensures that if an (u,v,W_{uv}) edge is merged, then we have that W_{max}/alpha <= W_{uv}, where W_{max} is the maximum similarity in the graph at the time of the (u,v) merge. Given a user-specified value of epsilon, the ParHac algorithm guarantees that alpha = (1+epsilon)^2. Note that if epsilon is not specified, we use a default value of 0.1.
This parameter is used to *cut* the approximate dendrogram that's obtained by running the ParHac algorithm. We use the GetSubtreeClustering algorithm from parallel-dendrogram.h to perform the cut. In the case where the leaf-to-root paths in the dendrogram are monotonically decreasing, we will return a clustering induced by the connected components of the dendrogram where all edges with weight below the weight threshold are removed. Please see documentation in parallel-dendrogram.h for details when the leaf-to-root paths are not monotonically decreasing, which may occur in the ParHac algorithm.
Shared-memory implementation of TeraHac: https://arxiv.org/abs/2308.03578
Used in:
This parameter controls the accuracy of the TeraHac algorithm. Specifically, TeraHac computes a (1+epsilon)-approximate dendrogram for average-linkage (see the ParHacConfig description for an explanation of c-approximate HAC. Note that if epsilon is not specified, we use a default value of 0.1.
This parameter is used to both *prune* edges while running the TeraHac algorithm, and to *cut* the final dendrogram that's obtained by running the TeraHac algorithm. We will return a clustering induced by the connected components of the dendrogram where all edges with weight below the weight threshold are removed.