Get desktop application:
View/edit binary Protocol Buffers messages
An asset represents a cloud resource that is being managed within a lake as a member of a zone.
Used in:
Output only. The relative resource name of the asset, of the form: `projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/assets/{asset_id}`.
Optional. User friendly display name.
Output only. System generated globally unique ID for the asset. This ID will be different if the asset is deleted and re-created with the same name.
Output only. The time when the asset was created.
Output only. The time when the asset was last updated.
Optional. User defined labels for the asset.
Optional. Description of the asset.
Output only. Current state of the asset.
Required. Specification of the resource that is referenced by this asset.
Output only. Status of the resource referenced by this asset.
Output only. Status of the security policy applied to resource referenced by this asset.
Optional. Specification of the discovery feature applied to data referenced by this asset. When this spec is left unset, the asset will use the spec set on the parent zone.
Output only. Status of the discovery feature applied to data referenced by this asset.
Settings to manage the metadata discovery and publishing for an asset.
Used in:
Optional. Whether discovery is enabled.
Optional. The list of patterns to apply for selecting data to include during discovery if only a subset of the data should considered. For Cloud Storage bucket assets, these are interpreted as glob patterns used to match object names. For BigQuery dataset assets, these are interpreted as patterns to match table names.
Optional. The list of patterns to apply for selecting data to exclude during discovery. For Cloud Storage bucket assets, these are interpreted as glob patterns used to match object names. For BigQuery dataset assets, these are interpreted as patterns to match table names.
Optional. Configuration for CSV data.
Optional. Configuration for Json data.
Determines when discovery is triggered.
Optional. Cron schedule (https://en.wikipedia.org/wiki/Cron) for running discovery periodically. Successive discovery runs must be scheduled at least 60 minutes apart. The default value is to run discovery every 60 minutes. To explicitly set a timezone to the cron tab, apply a prefix in the cron tab: "CRON_TZ=${IANA_TIME_ZONE}" or TZ=${IANA_TIME_ZONE}". The ${IANA_TIME_ZONE} may only be a valid string from IANA time zone database. For example, `CRON_TZ=America/New_York 1 * * * *`, or `TZ=America/New_York 1 * * * *`.
Describe CSV and similar semi-structured data formats.
Used in:
Optional. The number of rows to interpret as header rows that should be skipped when reading data rows.
Optional. The delimiter being used to separate values. This defaults to ','.
Optional. The character encoding of the data. The default is UTF-8.
Optional. Whether to disable the inference of data type for CSV data. If true, all columns will be registered as strings.
Describe JSON data format.
Used in:
Optional. The character encoding of the data. The default is UTF-8.
Optional. Whether to disable the inference of data type for Json data. If true, all columns will be registered as their primitive types (strings, number or boolean).
Status of discovery for an asset.
Used in:
The current status of the discovery feature.
Additional information about the current state.
Last update time of the status.
The start time of the last discovery run.
Data Stats of the asset reported by discovery.
The duration of the last discovery run.
Current state of discovery.
Used in:
State is unspecified.
Discovery for the asset is scheduled.
Discovery for the asset is running.
Discovery for the asset is currently paused (e.g. due to a lack of available resources). It will be automatically resumed.
Discovery for the asset is disabled.
The aggregated data statistics for the asset reported by discovery.
Used in:
The count of data items within the referenced resource.
The number of stored data bytes within the referenced resource.
The count of table entities within the referenced resource.
The count of fileset entities within the referenced resource.
Identifies the cloud resource that is referenced by this asset.
Used in:
Immutable. Relative name of the cloud resource that contains the data that is being managed within a lake. For example: `projects/{project_number}/buckets/{bucket_id}` `projects/{project_number}/datasets/{dataset_id}`
Required. Immutable. Type of resource.
Optional. Determines how read permissions are handled for each asset and their associated tables. Only available to storage buckets assets.
Access Mode determines how data stored within the resource is read. This is only applicable to storage bucket assets.
Used in:
Access mode unspecified.
Default. Data is accessed directly using storage APIs.
Data is accessed through a managed interface using BigQuery APIs.
Type of resource.
Used in:
Type not specified.
Cloud Storage bucket.
BigQuery dataset.
Status of the resource referenced by an asset.
Used in:
The current state of the managed resource.
Additional information about the current state.
Last update time of the status.
Output only. Service account associated with the BigQuery Connection.
The state of a resource.
Used in:
State unspecified.
Resource does not have any errors.
Resource has errors.
Security policy status of the asset. Data security policy, i.e., readers, writers & owners, should be specified in the lake/zone/asset IAM policy.
Used in:
The current state of the security policy applied to the attached resource.
Additional information about the current state.
Last update time of the status.
The state of the security policy.
Used in:
State unspecified.
Security policy has been successfully applied to the attached resource.
Security policy is in the process of being applied to the attached resource.
Security policy could not be applied to the attached resource due to errors.
The CloudEvent raised when an Asset is created.
The data associated with the event.
The CloudEvent raised when an Asset is deleted.
The data associated with the event.
The data within all Asset events.
Used in:
, ,Optional. The Asset event payload. Unset for deletion events.
Aggregated status of the underlying assets of a lake or zone.
Used in:
,Last update time of the status.
Number of active assets.
Number of assets that are in process of updating the security policy on attached resources.
The CloudEvent raised when an Asset is updated.
The data associated with the event.
DataAccessSpec holds the access control configuration to be enforced on data stored within resources (eg: rows, columns in BigQuery Tables). When associated with data, the data is only accessible to principals explicitly granted access through the DataAccessSpec. Principals with access to the containing resource are not implicitly granted access.
Used in:
Optional. The format of strings follows the pattern followed by IAM in the bindings. user:{email}, serviceAccount:{email} group:{email}. The set of principals to be granted reader role on data stored within resources.
Denotes one dataAttribute in a dataTaxonomy, for example, PII. DataAttribute resources can be defined in a hierarchy. A single dataAttribute resource can contain specs of multiple types ``` PII - ResourceAccessSpec : - readers :foo@bar.com - DataAccessSpec : - readers :bar@foo.com ```
Used in:
Output only. The relative resource name of the dataAttribute, of the form: projects/{project_number}/locations/{location_id}/dataTaxonomies/{dataTaxonomy}/attributes/{data_attribute_id}.
Output only. System generated globally unique ID for the DataAttribute. This ID will be different if the DataAttribute is deleted and re-created with the same name.
Output only. The time when the DataAttribute was created.
Output only. The time when the DataAttribute was last updated.
Optional. Description of the DataAttribute.
Optional. User friendly display name.
Optional. User-defined labels for the DataAttribute.
Optional. The ID of the parent DataAttribute resource, should belong to the same data taxonomy. Circular dependency in parent chain is not valid. Maximum depth of the hierarchy allowed is 4. [a -> b -> c -> d -> e, depth = 4]
Output only. The number of child attributes present for this attribute.
This checksum is computed by the server based on the value of other fields, and may be sent on update and delete requests to ensure the client has an up-to-date value before proceeding.
Optional. Specified when applied to a resource (eg: Cloud Storage bucket, BigQuery dataset, BigQuery table).
Optional. Specified when applied to data stored on the resource (eg: rows, columns in BigQuery Tables).
DataAttributeBinding represents binding of attributes to resources. Eg: Bind 'CustomerInfo' entity with 'PII' attribute.
Used in:
Output only. The relative resource name of the Data Attribute Binding, of the form: projects/{project_number}/locations/{location}/dataAttributeBindings/{data_attribute_binding_id}
Output only. System generated globally unique ID for the DataAttributeBinding. This ID will be different if the DataAttributeBinding is deleted and re-created with the same name.
Output only. The time when the DataAttributeBinding was created.
Output only. The time when the DataAttributeBinding was last updated.
Optional. Description of the DataAttributeBinding.
Optional. User friendly display name.
Optional. User-defined labels for the DataAttributeBinding.
This checksum is computed by the server based on the value of other fields, and may be sent on update and delete requests to ensure the client has an up-to-date value before proceeding. Etags must be used when calling the DeleteDataAttributeBinding and the UpdateDataAttributeBinding method.
The reference to the resource that is associated to attributes.
Optional. Immutable. The resource name of the resource that is associated to attributes. Presently, only entity resource is supported in the form: projects/{project}/locations/{location}/lakes/{lake}/zones/{zone}/entities/{entity_id} Must belong in the same project and region as the attribute binding, and there can only exist one active binding for a resource.
Optional. List of attributes to be associated with the resource, provided in the form: projects/{project}/locations/{location}/dataTaxonomies/{dataTaxonomy}/attributes/{data_attribute_id}
Optional. The list of paths for items within the associated resource (eg. columns within a table) along with attribute bindings.
Represents a subresource of a given resource, and associated bindings with it.
Used in:
Required. The name identifier of the path. Nested columns should be of the form: 'country.state.city'.
Optional. List of attributes to be associated with the path of the resource, provided in the form: projects/{project}/locations/{location}/dataTaxonomies/{dataTaxonomy}/attributes/{data_attribute_id}
The CloudEvent raised when a DataAttributeBinding is created.
The data associated with the event.
The CloudEvent raised when a DataAttributeBinding is deleted.
The data associated with the event.
The data within all DataAttributeBinding events.
Used in:
, ,Optional. The DataAttributeBinding event payload. Unset for deletion events.
The CloudEvent raised when a DataAttributeBinding is updated.
The data associated with the event.
The CloudEvent raised when a DataAttribute is created.
The data associated with the event.
The CloudEvent raised when a DataAttribute is deleted.
The data associated with the event.
The data within all DataAttribute events.
Used in:
, ,Optional. The DataAttribute event payload. Unset for deletion events.
The CloudEvent raised when a DataAttribute is updated.
The data associated with the event.
DataProfileResult defines the output of DataProfileScan. Each field of the table will have field type specific profile result.
Used in:
The count of rows scanned.
The profile information per field.
The data scanned for this result.
Contains name, type, mode and field type specific profile information.
Used in:
List of fields with structural and profile information for each field.
A field within a table.
Used in:
The name of the field.
The field data type. Possible values include: * STRING * BYTE * INT64 * INT32 * INT16 * DOUBLE * FLOAT * DECIMAL * BOOLEAN * BINARY * TIMESTAMP * DATE * TIME * NULL * RECORD
The mode of the field. Possible values include: * REQUIRED, if it is a required field. * NULLABLE, if it is an optional field. * REPEATED, if it is a repeated field.
Profile information for the corresponding field.
The profile information for each field type.
Used in:
Ratio of rows with null value against total scanned rows.
Ratio of rows with distinct values against total scanned rows. Not available for complex non-groupable field type RECORD and fields with REPEATABLE mode.
The list of top N non-null values and number of times they occur in the scanned data. N is 10 or equal to the number of distinct values in the field, whichever is smaller. Not available for complex non-groupable field type RECORD and fields with REPEATABLE mode.
Structural and profile information for specific field type. Not available, if mode is REPEATABLE.
String type field information.
Integer type field information.
Double type field information.
The profile information for a double type field.
Used in:
Average of non-null values in the scanned data. NaN, if the field has a NaN.
Standard deviation of non-null values in the scanned data. NaN, if the field has a NaN.
Minimum of non-null values in the scanned data. NaN, if the field has a NaN.
A quartile divides the number of data points into four parts, or quarters, of more-or-less equal size. Three main quartiles used are: The first quartile (Q1) splits off the lowest 25% of data from the highest 75%. It is also known as the lower or 25th empirical quartile, as 25% of the data is below this point. The second quartile (Q2) is the median of a data set. So, 50% of the data lies below this point. The third quartile (Q3) splits off the highest 25% of data from the lowest 75%. It is known as the upper or 75th empirical quartile, as 75% of the data lies below this point. Here, the quartiles is provided as an ordered list of quartile values for the scanned data, occurring in order Q1, median, Q3.
Maximum of non-null values in the scanned data. NaN, if the field has a NaN.
The profile information for an integer type field.
Used in:
Average of non-null values in the scanned data. NaN, if the field has a NaN.
Standard deviation of non-null values in the scanned data. NaN, if the field has a NaN.
Minimum of non-null values in the scanned data. NaN, if the field has a NaN.
A quartile divides the number of data points into four parts, or quarters, of more-or-less equal size. Three main quartiles used are: The first quartile (Q1) splits off the lowest 25% of data from the highest 75%. It is also known as the lower or 25th empirical quartile, as 25% of the data is below this point. The second quartile (Q2) is the median of a data set. So, 50% of the data lies below this point. The third quartile (Q3) splits off the highest 25% of data from the lowest 75%. It is known as the upper or 75th empirical quartile, as 75% of the data lies below this point. Here, the quartiles is provided as an ordered list of quartile values for the scanned data, occurring in order Q1, median, Q3.
Maximum of non-null values in the scanned data. NaN, if the field has a NaN.
The profile information for a string type field.
Used in:
Minimum length of non-null values in the scanned data.
Maximum length of non-null values in the scanned data.
Average length of non-null values in the scanned data.
Top N non-null values in the scanned data.
Used in:
String value of a top N non-null value.
Count of the corresponding value in the scanned data.
DataProfileScan related setting.
Used in:
(message has no fields)
DataQualityDimensionResult provides a more detailed, per-dimension view of the results.
Used in:
Whether the dimension passed or failed.
The output of a DataQualityScan.
Used in:
Overall data quality result -- `true` if all rules passed.
A list of results at the dimension level.
A list of all the rules in a job, and their results.
The count of rows processed.
The data scanned for this result.
A rule captures data quality intent about a data source.
Used in:
,ColumnMap rule which evaluates whether each column value lies between a specified range.
ColumnMap rule which evaluates whether each column value is null.
ColumnMap rule which evaluates whether each column value is contained by a specified set.
ColumnMap rule which evaluates whether each column value matches a specified regex.
ColumnAggregate rule which evaluates whether the column has duplicates.
ColumnAggregate rule which evaluates whether the column aggregate statistic lies between a specified range.
Table rule which evaluates whether each row passes the specified condition.
Table rule which evaluates whether the provided expression is true.
Optional. The unnested column which this rule is evaluated against.
Optional. Rows with `null` values will automatically fail a rule, unless `ignore_null` is `true`. In that case, such `null` rows are trivially considered passing. Only applicable to ColumnMap rules.
Required. The dimension a rule belongs to. Results are also aggregated at the dimension level. Supported dimensions are **["COMPLETENESS", "ACCURACY", "CONSISTENCY", "VALIDITY", "UNIQUENESS", "INTEGRITY"]**
Optional. The minimum ratio of **passing_rows / total_rows** required to pass this rule, with a range of [0.0, 1.0]. 0 indicates default value (i.e. 1.0).
Evaluates whether each column value is null.
Used in:
(message has no fields)
Evaluates whether each column value lies between a specified range.
Used in:
Optional. The minimum column value allowed for a row to pass this validation. At least one of `min_value` and `max_value` need to be provided.
Optional. The maximum column value allowed for a row to pass this validation. At least one of `min_value` and `max_value` need to be provided.
Optional. Whether each value needs to be strictly greater than ('>') the minimum, or if equality is allowed. Only relevant if a `min_value` has been defined. Default = false.
Optional. Whether each value needs to be strictly lesser than ('<') the maximum, or if equality is allowed. Only relevant if a `max_value` has been defined. Default = false.
Evaluates whether each column value matches a specified regex.
Used in:
A regular expression the column value is expected to match.
Evaluates whether each row passes the specified condition. The SQL expression needs to use BigQuery standard SQL syntax and should produce a boolean value per row as the result. Example: col1 >= 0 AND col2 < 10
Used in:
The SQL expression.
Evaluates whether each column value is contained by a specified set.
Used in:
Expected values for the column value.
Evaluates whether the column aggregate statistic lies between a specified range.
Used in:
The minimum column statistic value allowed for a row to pass this validation. At least one of `min_value` and `max_value` need to be provided.
The maximum column statistic value allowed for a row to pass this validation. At least one of `min_value` and `max_value` need to be provided.
Whether column statistic needs to be strictly greater than ('>') the minimum, or if equality is allowed. Only relevant if a `min_value` has been defined. Default = false.
Whether column statistic needs to be strictly lesser than ('<') the maximum, or if equality is allowed. Only relevant if a `max_value` has been defined. Default = false.
Used in:
Unspecified statistic type
Evaluate the column mean
Evaluate the column min
Evaluate the column max
Evaluates whether the provided expression is true. The SQL expression needs to use BigQuery standard SQL syntax and should produce a scalar boolean result. Example: MIN(col1) >= 0
Used in:
The SQL expression.
Evaluates whether the column has duplicates.
Used in:
(message has no fields)
DataQualityRuleResult provides a more detailed, per-rule view of the results.
Used in:
The rule specified in the DataQualitySpec, as is.
Whether the rule passed or failed.
The number of rows a rule was evaluated against. This field is only valid for ColumnMap type rules. Evaluated count can be configured to either * include all rows (default) - with `null` rows automatically failing rule evaluation, or * exclude `null` rows from the `evaluated_count`, by setting `ignore_nulls = true`.
The number of rows which passed a rule evaluation. This field is only valid for ColumnMap type rules.
The number of rows with null values in the specified column.
The ratio of **passed_count / evaluated_count**. This field is only valid for ColumnMap type rules.
The query to find rows that did not pass this rule. Only applies to ColumnMap and RowCondition rules.
DataQualityScan related setting.
Used in:
The list of rules to evaluate against a data source. At least one rule is required.
Represents a user-visible job which provides the insights for the related data source. For example: * Data Quality: generates queries based on the rules and runs against the data to get data quality check results. * Data Profile: analyzes the data in table(s) and generates insights about the structure, content and relationships (such as null percent, cardinality, min/max/mean, etc).
Used in:
Output only. The relative resource name of the scan, of the form: `projects/{project}/locations/{location_id}/dataScans/{datascan_id}`, where `project` refers to a *project_id* or *project_number* and `location_id` refers to a GCP region.
Output only. System generated globally unique ID for the scan. This ID will be different if the scan is deleted and re-created with the same name.
Optional. Description of the scan. * Must be between 1-1024 characters.
Optional. User friendly display name. * Must be between 1-256 characters.
Optional. User-defined labels for the scan.
Output only. Current state of the DataScan.
Output only. The time when the scan was created.
Output only. The time when the scan was last updated.
Required. The data source for DataScan.
Optional. DataScan execution settings. If not specified, the fields in it will use their default values.
Output only. Status of the data scan execution.
Output only. The type of DataScan.
Data Scan related setting. It is required and immutable which means once data_quality_spec is set, it cannot be changed to data_profile_spec.
DataQualityScan related setting.
DataProfileScan related setting.
The result of the data scan.
Output only. The result of the data quality scan.
Output only. The result of the data profile scan.
DataScan execution settings.
Used in:
Optional. Spec related to how often and when a scan should be triggered. If not specified, the default is `OnDemand`, which means the scan will not run until the user calls `RunDataScan` API.
Spec related to incremental scan of the data When an option is selected for incremental scan, it cannot be unset or changed. If not specified, a data scan will run for all data in the table.
Immutable. The unnested field (of type *Date* or *Timestamp*) that contains values which monotonically increase over time. If not specified, a data scan will run for all data in the table.
Status of the data scan execution.
Used in:
The time when the latest DataScanJob started.
The time when the latest DataScanJob ended.
The CloudEvent raised when a DataScan is created.
The data associated with the event.
The CloudEvent raised when a DataScan is deleted.
The data associated with the event.
The data within all DataScan events.
Used in:
, ,Optional. The DataScan event payload. Unset for deletion events.
The type of DataScan.
Used in:
The DataScan type is unspecified.
Data Quality scan.
Data Profile scan.
The CloudEvent raised when a DataScan is updated.
The data associated with the event.
The data source for DataScan.
Used in:
The source is required and immutable. Once it is set, it cannot be change to others.
Immutable. The Dataplex entity that represents the data source (e.g. BigQuery table) for DataScan, of the form: `projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/entities/{entity_id}`.
DataTaxonomy represents a set of hierarchical DataAttributes resources, grouped with a common theme Eg: 'SensitiveDataTaxonomy' can have attributes to manage PII data. It is defined at project level.
Used in:
Output only. The relative resource name of the DataTaxonomy, of the form: projects/{project_number}/locations/{location_id}/dataTaxonomies/{data_taxonomy_id}.
Output only. System generated globally unique ID for the dataTaxonomy. This ID will be different if the DataTaxonomy is deleted and re-created with the same name.
Output only. The time when the DataTaxonomy was created.
Output only. The time when the DataTaxonomy was last updated.
Optional. Description of the DataTaxonomy.
Optional. User friendly display name.
Optional. User-defined labels for the DataTaxonomy.
Output only. The number of attributes in the DataTaxonomy.
This checksum is computed by the server based on the value of other fields, and may be sent on update and delete requests to ensure the client has an up-to-date value before proceeding.
The CloudEvent raised when a DataTaxonomy is created.
The data associated with the event.
The CloudEvent raised when a DataTaxonomy is deleted.
The data associated with the event.
The data within all DataTaxonomy events.
Used in:
, ,Optional. The DataTaxonomy event payload. Unset for deletion events.
The CloudEvent raised when a DataTaxonomy is updated.
The data associated with the event.
Environment represents a user-visible compute infrastructure for analytics within a lake.
Used in:
Output only. The relative resource name of the environment, of the form: projects/{project_id}/locations/{location_id}/lakes/{lake_id}/environment/{environment_id}
Optional. User friendly display name.
Output only. System generated globally unique ID for the environment. This ID will be different if the environment is deleted and re-created with the same name.
Output only. Environment creation time.
Output only. The time when the environment was last updated.
Optional. User defined labels for the environment.
Optional. Description of the environment.
Output only. Current state of the environment.
Required. Infrastructure specification for the Environment.
Optional. Configuration for sessions created for this environment.
Output only. Status of sessions created for this environment.
Output only. URI Endpoints to access sessions associated with the Environment.
URI Endpoints to access sessions associated with the Environment.
Used in:
Output only. URI to serve notebook APIs
Output only. URI to serve SQL APIs
Configuration for the underlying infrastructure used to run workloads.
Used in:
Hardware config
Optional. Compute resources needed for analyze interactive workloads.
Software config
Required. Software Runtime Configuration for analyze interactive workloads.
Compute resources associated with the analyze interactive workloads.
Used in:
Optional. Size in GB of the disk. Default is 100 GB.
Optional. Total number of nodes in the sessions created for this environment.
Optional. Max configurable nodes. If max_node_count > node_count, then auto-scaling is enabled.
Software Runtime Configuration to run Analyze.
Used in:
Required. Dataplex Image version.
Optional. List of Java jars to be included in the runtime environment. Valid input includes Cloud Storage URIs to Jar binaries. For example, gs://bucket-name/my/path/to/file.jar
Optional. A list of python packages to be installed. Valid formats include Cloud Storage URI to a PIP installable library. For example, gs://bucket-name/my/path/to/lib.tar.gz
Optional. Spark properties to provide configuration for use in sessions created for this environment. The properties to set on daemon config files. Property keys are specified in `prefix:property` format. The prefix must be "spark".
Configuration for sessions created for this environment.
Used in:
Optional. The idle time configuration of the session. The session will be auto-terminated at the end of this period.
Optional. If True, this causes sessions to be pre-created and available for faster startup to enable interactive exploration use-cases. This defaults to False to avoid additional billed charges. These can only be set to True for the environment with name set to "default", and with default configuration.
Status of sessions created for this environment.
Used in:
Output only. Queries over sessions to mark whether the environment is currently active or not
The CloudEvent raised when an Environment is created.
The data associated with the event.
The CloudEvent raised when an Environment is deleted.
The data associated with the event.
The data within all Environment events.
Used in:
, ,Optional. The Environment event payload. Unset for deletion events.
The CloudEvent raised when an Environment is updated.
The data associated with the event.
A job represents an instance of a task.
Used in:
Output only. The relative resource name of the job, of the form: `projects/{project_number}/locations/{location_id}/lakes/{lake_id}/tasks/{task_id}/jobs/{job_id}`.
Output only. System generated globally unique ID for the job.
Output only. The time when the job was started.
Output only. The time when the job ended.
Output only. Execution state for the job.
Output only. The number of times the job has been retried (excluding the initial attempt).
Output only. The underlying service running a job.
Output only. The full resource name for the job run under a particular service.
Output only. Additional information about the current state.
Used in:
Service used to run the job is unspecified.
Dataproc service is used to run this job.
Used in:
The job state is unknown.
The job is running.
The job is cancelling.
The job cancellation was successful.
The job completed successfully.
The job is no longer running due to an error.
The job was cancelled outside of Dataplex.
A lake is a centralized repository for managing enterprise data across the organization distributed across many cloud projects, and stored in a variety of storage services such as Google Cloud Storage and BigQuery. The resources attached to a lake are referred to as managed resources. Data within these managed resources can be structured or unstructured. A lake provides data admins with tools to organize, secure and manage their data at scale, and provides data scientists and data engineers an integrated experience to easily search, discover, analyze and transform data and associated metadata.
Used in:
Output only. The relative resource name of the lake, of the form: `projects/{project_number}/locations/{location_id}/lakes/{lake_id}`.
Optional. User friendly display name.
Output only. System generated globally unique ID for the lake. This ID will be different if the lake is deleted and re-created with the same name.
Output only. The time when the lake was created.
Output only. The time when the lake was last updated.
Optional. User-defined labels for the lake.
Optional. Description of the lake.
Output only. Current state of the lake.
Output only. Service account associated with this lake. This service account must be authorized to access or operate on resources managed by the lake.
Optional. Settings to manage lake and Dataproc Metastore service instance association.
Output only. Aggregated status of the underlying assets of the lake.
Output only. Metastore status of the lake.
Settings to manage association of Dataproc Metastore with a lake.
Used in:
Optional. A relative reference to the Dataproc Metastore (https://cloud.google.com/dataproc-metastore/docs) service associated with the lake: `projects/{project_id}/locations/{location_id}/services/{service_id}`
Status of Lake and Dataproc Metastore service instance association.
Used in:
Current state of association.
Additional information about the current status.
Last update time of the metastore status of the lake.
The URI of the endpoint used to access the Metastore service.
Current state of association.
Used in:
Unspecified.
A Metastore service instance is not associated with the lake.
A Metastore service instance is attached to the lake.
Attach/detach is in progress.
Attach/detach could not be done due to errors.
The CloudEvent raised when a Lake is created.
The data associated with the event.
The CloudEvent raised when a Lake is deleted.
The data associated with the event.
The data within all Lake events.
Used in:
, ,Optional. The Lake event payload. Unset for deletion events.
The CloudEvent raised when a Lake is updated.
The data associated with the event.
ResourceAccessSpec holds the access control configuration to be enforced on the resources, for example, Cloud Storage bucket, BigQuery dataset, BigQuery table.
Used in:
Optional. The format of strings follows the pattern followed by IAM in the bindings. user:{email}, serviceAccount:{email} group:{email}. The set of principals to be granted reader role on the resource.
Optional. The set of principals to be granted writer role on the resource.
Optional. The set of principals to be granted owner role on the resource.
The data scanned during processing (e.g. in incremental DataScan)
Used in:
,The range of scanned data
The range denoted by values of an incremental field
A data range denoted by a pair of start/end values of a field.
Used in:
The field that contains values which monotonically increases over time (e.g. a timestamp column).
Value that marks the start of the range.
Value that marks the end of the range.
State of a resource.
Used in:
, , , , ,State is not specified.
Resource is active, i.e., ready to use.
Resource is under creation.
Resource is under deletion.
Resource is active but has unresolved actions.
A task represents a user-visible job.
Used in:
Output only. The relative resource name of the task, of the form: projects/{project_number}/locations/{location_id}/lakes/{lake_id}/ tasks/{task_id}.
Output only. System generated globally unique ID for the task. This ID will be different if the task is deleted and re-created with the same name.
Output only. The time when the task was created.
Output only. The time when the task was last updated.
Optional. Description of the task.
Optional. User friendly display name.
Output only. Current state of the task.
Optional. User-defined labels for the task.
Required. Spec related to how often and when a task should be triggered.
Required. Spec related to how a task is executed.
Output only. Status of the latest task executions.
Task template specific user-specified config.
Config related to running custom Spark tasks.
Config related to running scheduled Notebooks.
Execution related settings, like retry and service_account.
Used in:
Optional. The arguments to pass to the task. The args can use placeholders of the format ${placeholder} as part of key/value string. These will be interpolated before passing the args to the driver. Currently supported placeholders: - ${task_id} - ${job_time} To pass positional args, set the key as TASK_ARGS. The value should be a comma-separated string of all the positional arguments. To use a delimiter other than comma, refer to https://cloud.google.com/sdk/gcloud/reference/topic/escaping. In case of other keys being present in the args, then TASK_ARGS will be passed as the last argument.
Required. Service account to use to execute a task. If not provided, the default Compute service account for the project is used.
Optional. The project in which jobs are run. By default, the project containing the Lake is used. If a project is provided, the [ExecutionSpec.service_account][google.cloud.dataplex.v1.Task.ExecutionSpec.service_account] must belong to this project.
Optional. The maximum duration after which the job execution is expired.
Optional. The Cloud KMS key to use for encryption, of the form: `projects/{project_number}/locations/{location_id}/keyRings/{key-ring-name}/cryptoKeys/{key-name}`.
Status of the task execution (e.g. Jobs).
Used in:
Output only. Last update time of the status.
Output only. latest job execution
Configuration for the underlying infrastructure used to run workloads.
Used in:
,Hardware config.
Compute resources needed for a Task when using Dataproc Serverless.
Software config.
Container Image Runtime Configuration.
Networking config.
Vpc network.
Batch compute resources associated with the task.
Used in:
Optional. Total number of job executors. Executor Count should be between 2 and 100. [Default=2]
Optional. Max configurable executors. If max_executors_count > executors_count, then auto-scaling is enabled. Max Executor Count should be between 2 and 1000. [Default=1000]
Container Image Runtime Configuration used with Batch execution.
Used in:
Optional. Container image to use.
Optional. A list of Java JARS to add to the classpath. Valid input includes Cloud Storage URIs to Jar binaries. For example, gs://bucket-name/my/path/to/file.jar
Optional. A list of python packages to be installed. Valid formats include Cloud Storage URI to a PIP installable library. For example, gs://bucket-name/my/path/to/lib.tar.gz
Optional. Override to common configuration of open source components installed on the Dataproc cluster. The properties to set on daemon config files. Property keys are specified in `prefix:property` format, for example `core:hadoop.tmp.dir`. For more information, see [Cluster properties](https://cloud.google.com/dataproc/docs/concepts/cluster-properties).
Cloud VPC Network used to run the infrastructure.
Used in:
The Cloud VPC network identifier.
Optional. The Cloud VPC network in which the job is run. By default, the Cloud VPC network named Default within the project is used.
Optional. The Cloud VPC sub-network in which the job is run.
Optional. List of network tags to apply to the job.
Config for running scheduled notebooks.
Used in:
Required. Path to input notebook. This can be the Cloud Storage URI of the notebook file or the path to a Notebook Content. The execution args are accessible as environment variables (`TASK_key=value`).
Optional. Infrastructure specification for the execution.
Optional. Cloud Storage URIs of files to be placed in the working directory of each executor.
Optional. Cloud Storage URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
User-specified config for running a Spark task.
Used in:
Required. The specification of the main method to call to drive the job. Specify either the jar file that contains the main class or the main class name.
The Cloud Storage URI of the jar file that contains the main class. The execution args are passed in as a sequence of named process arguments (`--key=value`).
The name of the driver's main class. The jar file that contains the class must be in the default CLASSPATH or specified in `jar_file_uris`. The execution args are passed in as a sequence of named process arguments (`--key=value`).
The Gcloud Storage URI of the main Python file to use as the driver. Must be a .py file. The execution args are passed in as a sequence of named process arguments (`--key=value`).
A reference to a query file. This can be the Cloud Storage URI of the query file or it can the path to a SqlScript Content. The execution args are used to declare a set of script variables (`set key="value";`).
The query text. The execution args are used to declare a set of script variables (`set key="value";`).
Optional. Cloud Storage URIs of files to be placed in the working directory of each executor.
Optional. Cloud Storage URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
Optional. Infrastructure specification for the execution.
Task scheduling and trigger settings.
Used in:
Required. Immutable. Trigger type of the user-specified Task.
Optional. The first run of the task will be after this time. If not specified, the task will run shortly after being submitted if ON_DEMAND and based on the schedule if RECURRING.
Optional. Prevent the task from executing. This does not cancel already running tasks. It is intended to temporarily disable RECURRING tasks.
Optional. Number of retry attempts before aborting. Set to zero to never attempt to retry a failed task.
Trigger only applies for RECURRING tasks.
Optional. Cron schedule (https://en.wikipedia.org/wiki/Cron) for running tasks periodically. To explicitly set a timezone to the cron tab, apply a prefix in the cron tab: "CRON_TZ=${IANA_TIME_ZONE}" or "TZ=${IANA_TIME_ZONE}". The ${IANA_TIME_ZONE} may only be a valid string from IANA time zone database. For example, `CRON_TZ=America/New_York 1 * * * *`, or `TZ=America/New_York 1 * * * *`. This field is required for RECURRING tasks.
Determines how often and when the job will run.
Used in:
Unspecified trigger type.
The task runs one-time shortly after Task Creation.
The task is scheduled to run periodically.
The CloudEvent raised when a Task is created.
The data associated with the event.
The CloudEvent raised when a Task is deleted.
The data associated with the event.
The data within all Task events.
Used in:
, ,Optional. The Task event payload. Unset for deletion events.
The CloudEvent raised when a Task is updated.
The data associated with the event.
DataScan scheduling and trigger settings.
Used in:
DataScan scheduling and trigger settings. If not specified, the default is `onDemand`.
The scan runs once via `RunDataScan` API.
The scan is scheduled to run periodically.
The scan runs once via `RunDataScan` API.
Used in:
(message has no fields)
The scan is scheduled to run periodically.
Used in:
Required. [Cron](https://en.wikipedia.org/wiki/Cron) schedule for running scans periodically. To explicitly set a timezone in the cron tab, apply a prefix in the cron tab: **"CRON_TZ=${IANA_TIME_ZONE}"** or **"TZ=${IANA_TIME_ZONE}"**. The **${IANA_TIME_ZONE}** may only be a valid string from IANA time zone database ([wikipedia](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List)). For example, `CRON_TZ=America/New_York 1 * * * *`, or `TZ=America/New_York 1 * * * *`. This field is required for Schedule scans.
A zone represents a logical group of related assets within a lake. A zone can be used to map to organizational structure or represent stages of data readiness from raw to curated. It provides managing behavior that is shared or inherited by all contained assets.
Used in:
Output only. The relative resource name of the zone, of the form: `projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}`.
Optional. User friendly display name.
Output only. System generated globally unique ID for the zone. This ID will be different if the zone is deleted and re-created with the same name.
Output only. The time when the zone was created.
Output only. The time when the zone was last updated.
Optional. User defined labels for the zone.
Optional. Description of the zone.
Output only. Current state of the zone.
Required. Immutable. The type of the zone.
Optional. Specification of the discovery feature applied to data in this zone.
Required. Specification of the resources that are referenced by the assets within this zone.
Output only. Aggregated status of the underlying assets of the zone.
Settings to manage the metadata discovery and publishing in a zone.
Used in:
Required. Whether discovery is enabled.
Optional. The list of patterns to apply for selecting data to include during discovery if only a subset of the data should considered. For Cloud Storage bucket assets, these are interpreted as glob patterns used to match object names. For BigQuery dataset assets, these are interpreted as patterns to match table names.
Optional. The list of patterns to apply for selecting data to exclude during discovery. For Cloud Storage bucket assets, these are interpreted as glob patterns used to match object names. For BigQuery dataset assets, these are interpreted as patterns to match table names.
Optional. Configuration for CSV data.
Optional. Configuration for Json data.
Determines when discovery is triggered.
Optional. Cron schedule (https://en.wikipedia.org/wiki/Cron) for running discovery periodically. Successive discovery runs must be scheduled at least 60 minutes apart. The default value is to run discovery every 60 minutes. To explicitly set a timezone to the cron tab, apply a prefix in the cron tab: "CRON_TZ=${IANA_TIME_ZONE}" or TZ=${IANA_TIME_ZONE}". The ${IANA_TIME_ZONE} may only be a valid string from IANA time zone database. For example, `CRON_TZ=America/New_York 1 * * * *`, or `TZ=America/New_York 1 * * * *`.
Describe CSV and similar semi-structured data formats.
Used in:
Optional. The number of rows to interpret as header rows that should be skipped when reading data rows.
Optional. The delimiter being used to separate values. This defaults to ','.
Optional. The character encoding of the data. The default is UTF-8.
Optional. Whether to disable the inference of data type for CSV data. If true, all columns will be registered as strings.
Describe JSON data format.
Used in:
Optional. The character encoding of the data. The default is UTF-8.
Optional. Whether to disable the inference of data type for Json data. If true, all columns will be registered as their primitive types (strings, number or boolean).
Settings for resources attached as assets within a zone.
Used in:
Required. Immutable. The location type of the resources that are allowed to be attached to the assets within this zone.
Location type of the resources attached to a zone.
Used in:
Unspecified location type.
Resources that are associated with a single region.
Resources that are associated with a multi-region location.
Type of zone.
Used in:
Zone type not specified.
A zone that contains data that needs further processing before it is considered generally ready for consumption and analytics workloads.
A zone that contains data that is considered to be ready for broader consumption and analytics workloads. Curated structured data stored in Cloud Storage must conform to certain file formats (parquet, avro and orc) and organized in a hive-compatible directory layout.
The CloudEvent raised when a Zone is created.
The data associated with the event.
The CloudEvent raised when a Zone is deleted.
The data associated with the event.
The data within all Zone events.
Used in:
, ,Optional. The Zone event payload. Unset for deletion events.
The CloudEvent raised when a Zone is updated.
The data associated with the event.