package semantic_model_generator

Get desktop application:
View/edit binary Protocol Buffers messages

AggregationType defines a list of various aggregations.

aggregation_type_unknown = 0
sum = 1
avg = 2
median = 7
min = 3
max = 4
count = 5
count_distinct = 6

Column is analogous to a database column and defines various semantic properties of a column. A column can either simply be a column in the base database schema or it can be an arbitrary expression over the base schema, e.g. `base_column1 + base_column2`.

Used in: Table

string name = 1
A descriptive name for this column.
repeated string synonyms = 2
A list of other terms/phrases used to refer to this column.
string description = 3
A brief description about this column, including things like what data this column has.
string expr = 4
The SQL expression for this column. Could simply be a base table column name or an arbitrary SQL expression over one or more columns of the base table.
string data_type = 5
The data type of this column. TODO(nsehrawat): Consider creating an enum instead, with all snowflake support data types.
ColumnKind kind = 6
The kind of this column - dimension or fact, metric.
bool unique = 7
If true, assume that this column has unique values.
AggregationType default_aggregation = 8
If no aggregation is specified, then this is the default aggregation applied to this column in contxt of a grouping.
repeated string sample_values = 9
Sample values of this column.
bool index_and_retrieve_values = 10
Whether to index the values and retrieve them based on the question. If False, all sample values will be used as input to the model.
repeated RetrievalResult retrieved_literals = 11
Retrieved literals of this column.
string cortex_search_service_name = 12
A Cortex Search Service configured on this column to retrieve literals.
optional CortexSearchService cortex_search_service = 13
bool is_enum = 14
If true, this column has limited possible values, all of which are in the sample_values field.

ColumnKind defines various kinds of columns, mainly categorized into dimensions and measures.

Used in: Column

column_kind_unknown = 0
dimension = 1
A column containing categorical values such as names, countries, dates.
measure = 2
A column containing numerical values such as revenue, impressions, salary. TODO: migrate to fact.
time_dimension = 3
A column containing date/time data.
metric = 4
A "column" containing calculations about an entity such as sum_revenue, cvr.

Fully qualified Cortex Search Service name.

Used in: Column, Dimension

string database = 1
string schema = 2
string service = 3
string literal_column = 4

Dimension columns contain categorical values (e.g. state, user_type, platform). NOTE: If modifying this protobuf, make appropriate changes in context_to_column_format() of snowpilot/semantic_context/protos/schema.py.

Used in: Table

string name = 1
A descriptive name for this dimension.
repeated string synonyms = 2
A list of other terms/phrases used to refer to this dimension.
string description = 3
A brief description about this dimension, including things like what data this dimension has.
string expr = 4
The SQL expression defining this dimension. Could simply be a physical column name or an arbitrary SQL expression over one or more columns of the physical table.
string data_type = 5
The data type of this dimension. TODO(nsehrawat): Consider creating an enum instead with all snowflake support data types.
bool unique = 6
If true, assume that this dimension has unique values.
repeated string sample_values = 7
Sample values of this column.
optional CortexSearchService cortex_search_service = 8
A Cortex Search Service configured on this column to retrieve literals.
string cortex_search_service_name = 9
bool is_enum = 10
If true, this column has limited possible values, all of which are in the sample_values field.

Measure columns contain numerical values (e.g. revenue, impressions, salary). NOTE: If modifying this protobuf, make appropriate changes in to_column_format() of snowpilot/semantic_context/utils/utils.py.

Used in: Table

string name = 1
A descriptive name for this measure.
repeated string synonyms = 2
A list of other terms/phrases used to refer to this measure.
string description = 3
A brief description about this measure, including things like what data it has.
string expr = 4
The SQL expression defining this measure. Could simply be a physical column name or an arbitrary SQL expression over one or more physical columns of the underlying physical table.
string data_type = 5
The data type of this measure. TODO(nsehrawat): Consider creating an enum instead, with all snowflake support data types.
AggregationType default_aggregation = 6
If no aggregation is specified, then this is the default aggregation applied to this measure in contxt of a grouping.
repeated string sample_values = 7
Sample values of this measure.

Defines a foreign key that references the primary key of another table.

Used in: Table

repeated string fkey_columns = 1
Base column names of the foreign key table.
optional FullyQualifiedTable pkey_table = 2
The primary key table that this foreign key references.
repeated string pkey_columns = 3
Base column names of the primary key table.

FullyQualifiedTable is used to represent three part table names - (database, schema, table).

Used in: ForeignKey, Table

string database = 1
string schema = 2
string table = 3

Type of the join - inner, left outer, etc.

Used in: Relationship

join_type_unknown = 0
inner = 1
left_outer = 2
full_outer = 3
cross = 4
right_outer = 5

Metric are named computation over a collection of columns. For now, we only allow a metric to be defined over columns from a single table. In future, we'll expand to allowing metrics that refer to columns from multiple tables.

Used in: Table

string name = 1
A descriptive name of the metric.
repeated string synonyms = 2
A list of other term/phrases used to refer to this metric.
string description = 3
A brief description of this metric, including details of what it computes.
string expr = 4
The SQL expression to compute this metric. All columns used must be fully qualified with the logical table name. Expression must be an aggregate
optional MetricsFilter filter = 5
The filter associated with this metric. Do not expose this for now.

Used in: Metric

string expr = 1

A message that encapsulates custom instructions for each module.

Used in: SemanticModel

string sql_generation = 1
Custom instructions for SQL Generation.
string question_categorization = 2
Custom instructions for Question Categorization.

Filter represents a named SQL expression that's used for filtering. TODO: add validation. we should only support where clause style filter (no aggregations) and reject having clauses.

Used in: Table

string name = 1
A descriptive name for this filter.
repeated string synonyms = 2
A list of other term/phrases used to refer to this column.
string description = 3
A brief description about this column, including details of what this filter is typically used for.
string expr = 4
The SQL expression of this filter.

Defines a primary key of a table. In the general case, primary keys are a collection of columns of the table. For discussion: PK FK are potentially duplicative to join path in a semantic model. However, it implies uniqueness which can be informative for getting right aggregation level. For that reason, we are exposing only the PrimaryKey currently. Join paths seem more extensible than foreign keys for supporting join. Further experimentation is needed to see if JoinPath and ForeignKey can yield similar results.

Used in: Table

repeated string columns = 1
Base column names that constitute the primary key.

Used in: Relationship

string left_column = 1
Only support equi-join relationship for now.
string right_column = 2

Relationship represents a join between two tables.

Used in: SemanticModel

string name = 1
A unique name of the join.
string left_table = 2
The left hand side table of the join.
string right_table = 3
The right hand side table of the join.
string expr = 4
The expression used to join left and right tables. Only used internally.
repeated RelationKey relationship_columns = 7
Keys directly represent the join relationship.
JoinType join_type = 5
Type of the join.
RelationshipType relationship_type = 6
Type of the relationship.

Type of the relationship - one-to-one, many-to-one, etc.

Used in: Relationship

relationship_type_unknown = 0
one_to_one = 1
many_to_one = 2
one_to_many = 3
many_to_many = 4

Used in: Column

string value = 1
float score = 2

The semantic context relevant to generating SQL for answering a data question.

string name = 1
A descriptive name of the project.
string description = 2
A brief description of this project, including details of what kind of analysis does this project enable.
repeated Table tables = 3
List of tables in this project.
repeated Relationship relationships = 5
List of relationships in this project.
repeated VerifiedQuery verified_queries = 6
List of verified queries for this semantic model.
string custom_instructions = 7
Custom instructions that will be applied to the final SQL generation.
optional ModuleCustomInstructions module_custom_instructions = 8
Module-specific custom instructions. The SQL generation instruction here will take precedence over the legacy custom_instructions if it exists.

Table is analogous to a database table and provides a simple view over an existing database table. A table can leave out some columns from the base table and/or introduce new derived columns.

Used in: SemanticModel

string name = 1
A descriptive name for this table.
repeated string synonyms = 2
A list of other term/phrases used to refer to this table.
string description = 3
A brief description of this table, including details of what kinds of analysis is it typically used for.
optional FullyQualifiedTable base_table = 4
Fully qualified name of the underlying base table.
repeated Column columns = 5
We allow two formats for specifying logical columns of a table: 1. As a list of columns. 2. As three separate list of dimensions, time dimensions, and measures. For the external facing yaml specification, we have chosen to go with (2). However, for the time being we'll support both (1) and (2) and continue using (1) as the internal representation.
repeated Dimension dimensions = 9
repeated TimeDimension time_dimensions = 10
repeated Fact measures = 11
repeated Fact facts = 12
repeated Metric metrics = 13
optional PrimaryKey primary_key = 6
Primary key of the table, if any.
repeated ForeignKey foreign_keys = 7
Foreign keys of the table, if any.
repeated NamedFilter filters = 8
Predefined filters on this table, if any.
NEXT_TAG: 14.

Time dimension columns contain time values (e.g. sale_date, created_at, year). NOTE: If modifying this protobuf, make appropriate changes in to_column_format() of snowpilot/semantic_context/utils/utils.py.

Used in: Table

string name = 1
A descriptive name for this time dimension.
repeated string synonyms = 2
A list of other terms/phrases used to refer to this time dimension.
string description = 3
A brief description about this time dimension, including things like what data it has, the timezone of values, etc.
string expr = 4
The SQL expression defining this time dimension. Could simply be a physical column name or an arbitrary SQL expression over one or more columns of the physical table.
string data_type = 5
The data type of this time dimension. TODO(nsehrawat): Consider creating an enum instead, with all snowflake support data types.
bool unique = 6
If true, assume that this time dimension has unique values.
repeated string sample_values = 7
Sample values of this time dimension.

VerifiedQuery represents a (question, sql) pair that has been manually verified (e.g. by an analyst) to be correct.

Used in: SemanticModel, VerifiedQueryRepository

string name = 1
A name for this verified query. Mainly used for display purposes.
string semantic_model_name = 2
The name of the semantic model on which this verified query is based off.
string question = 3
The question being answered.
string sql = 4
The correct SQL query for answering the question.
int64 verified_at = 5
Timestamp at which the query was last verified - measures in seconds since epoch, in UTC.
string verified_by = 6
Name of the person who verified this query.
bool use_as_onboarding_question = 7
Whether to always include in this question in the suggested questions module

VerifiedQueryRepository is a simply a collection of verified queries.

repeated VerifiedQuery verified_queries = 1

package semantic_model_generator

enum AggregationType

aggregation_type_unknown = 0

sum = 1

avg = 2

median = 7

min = 3

max = 4

count = 5

count_distinct = 6

message Column

string name = 1

repeated string synonyms = 2

string description = 3

string expr = 4

string data_type = 5

ColumnKind kind = 6

bool unique = 7

AggregationType default_aggregation = 8

repeated string sample_values = 9

bool index_and_retrieve_values = 10

repeated RetrievalResult retrieved_literals = 11

string cortex_search_service_name = 12

optional CortexSearchService cortex_search_service = 13

bool is_enum = 14

enum ColumnKind

column_kind_unknown = 0

dimension = 1

measure = 2

time_dimension = 3

metric = 4

message CortexSearchService

string database = 1

string schema = 2

string service = 3

string literal_column = 4

message Dimension

string name = 1

repeated string synonyms = 2

string description = 3

string expr = 4

string data_type = 5

bool unique = 6

repeated string sample_values = 7

optional CortexSearchService cortex_search_service = 8

string cortex_search_service_name = 9

bool is_enum = 10

message Fact

string name = 1

repeated string synonyms = 2

string description = 3

string expr = 4

string data_type = 5

AggregationType default_aggregation = 6

repeated string sample_values = 7

message ForeignKey

repeated string fkey_columns = 1

optional FullyQualifiedTable pkey_table = 2

repeated string pkey_columns = 3

message FullyQualifiedTable

string database = 1

string schema = 2

string table = 3

enum JoinType

join_type_unknown = 0

inner = 1

left_outer = 2

full_outer = 3

cross = 4

right_outer = 5

message Metric

string name = 1

repeated string synonyms = 2

string description = 3

string expr = 4

optional MetricsFilter filter = 5

message MetricsFilter

string expr = 1

message ModuleCustomInstructions

string sql_generation = 1