LATEST VERSION: 3.3.1 - CHANGELOG
Pivotal Greenplum Command Center v3.3.1

Understanding Rules

This topic provides an introduction to Workload Manager rules including how to write them and how they behave in a Greenplum Database cluster with Workload Manager.

Rules Overview

A Workload Manager rule specifies an action to execute when a specified condition is detected in the Greenplum Database cluster. Administrators write Workload Manager rules to investigate problem queries, throttle queries that consume too much CPU, or simply terminate queries that could disrupt the database system.

The rulesengine service on each Greenplum host evaluates rules against facts, called metrics, collected from the Greenplum host operating systems and database processes. At regular intervals, metrics are collected and submitted to the rulesengine service on each host. When the rules engine matches a rule, it performs its action.

A rule has an action expression and a condition expression separated by the WHEN keyword. It can be read as “do <action-exp> WHEN <condition-exp>”.

Here is a rule that terminates any session that has run for over 120 seconds:

pg_terminate_backend() when session_id:host:pid:runtime > 120

In the above rule, the action expression is pg_terminate_backend() and the condition expression is session_id:host:pid:runtime > 120.

The term session_id:host:pid:runtime is a scoped metric; runtime is the name of the metric and session_id:host:pid is the scope. This scoped metric specifies the elapsed execution time for a query executor process on a segment host. The colon-delimited sections of the scope and metric identify the source of the value:

  • session_id – ID of a Greenplum Database query
  • host – the name of a segment host
  • pid – process ID of a query executor process running on the host
  • runtime – elapsed time since the query executor process started

You create rules using the rule add command in an interactive gp-wlm session or with the --rule-add command-line option. Each rule has a unique name used for managing the rule with commands such as rule modify or rule delete.

A rule may also be labeled transient, which means the rule is active only until it is deleted or Workload Manager is restarted.

For details about the rule add command syntax and usage see Add Rules.

For reference information about Workload Manager commands that manage existing rules (modify, delete, dump, import, and restore), see Managing Rules.

The next sections provide more detailed information about the components of a rule: action expressions, condition expressions, metrics, and scopes.

Action Expression

The action to perform when a rule is triggered is specified with one of the following Workload Manager actions:

  • gpdb_record – record a custom message and details of the database query process in the gp_wlm_records database table.
  • host:throttle_gpdb_query – throttle a Greenplum Database query on a specified host.
  • host:pg_cancel_backend – cancel the current query in a session on a host by calling the PostgreSQL pg_cancel_backend() function.
  • pg_terminate_backend – terminate a session by calling the PostgreSQL pg_terminate_backend() function.

A rule’s condition expression always identifies a single query executor process on a single Greenplum segment host. When a rule’s action executes, it will have in its context the query’s session ID, a segment host name, and the process ID of a single query executor process on the host.

Each of the actions responds to the single Greenplum Database query executor process identified by the condition expression. See Rule Actions for reference information for the actions.

Action expressions are written as functions and can have zero or more arguments, specified with key=value pairs in parentheses after the action name:

<action-name>(<arg1>=<value1>,<arg2>=<value2>,...)

gpdb_record

The gpdb_record action writes the text specified in its message argument to a log file, along with details of the database query process identified in the rule’s condition expression. For example, the gpdb_record action can log a message when any query process exceeds 120 seconds:

gpdb_record(message='query runtime exceeds 120 seconds') when session_id:host:pid:runtime > 120

The gp_wlm_records external Greenplum Database table provides SQL query access to the logged records. (See Querying the gp_wlm_records Table for more information.)

The gpdb_record action has several arguments, but only the message argument is required to be specified in the rule. Here is the full list of arguments for this action:

  • message – Informative string describing the reason for recording
  • current_query - The text of the current query
  • gpdb_segment_role - Role of the database instance: GPDB_MASTER or GPDB_SEGMENT
  • host - The hostname of the segment
  • pid - The postgres process associated with the query
  • query_start - Query start time
  • session_id - Session id of the query
  • usename - Name of the user logged into this backend

With the exception of message, a value for each of these arguments is inferred from the matched query process. gpdb_record logs a record that includes the supplied message, all of these inferred values, the text of the rule, and context values from the condition expression.

host:throttle_gpdb_query

The host:throttle_gpdb_query action holds a query to a maximum share of CPU on a host, specified in the max_cpu argument as a percentage of CPU utilization.

The host: prefix on the host:throttle_gpdb_query action is a scope. The host: scope indicates that the action will be performed only on the host machines where the rule’s condition is matched. The host:throttle_gpdb_query action is currently the only scoped action. (Metrics used in the condition expression are all scoped. See Metrics and Scopes below for details.)

This host:throttle_gpdb_query rule throttles a query on a host to 30% CPU utilization:

host:throttle_gpdb_query(max_cpu=30) when session_id:host:total_cpu > 20 

The session_id:host:total_cpu scoped metric is the total percentage of CPU used by all query executor processes on a host working on the same query.

Note that this rule establishes a range between 20% and 30% CPU utilization. Throttling on a host begins when total CPU utilization for the query exceeds 20% and ends when it drops below 20%. Throttling keeps the CPU utilization from exceeding 30%. Setting max_cpu argument higher than the rule’s trigger threshold prevents rapidly alternating between throttling enabled and throttling disabled states that could occur if the threshold and maximum CPU are equal.

pg_cancel_backend

The host:pg_cancel_backend action cancels a query on a host. It executes the pg_cancel_backend() PostgreSQL function on the session matched by the condition expression.

The following rule cancels the current query in a session that exceeds 75% total CPU utilization on any segment host and has run for more than five minutes:

host:pg_cancel_backend() when session_id:host:total_cpu > 75 and session_id:host:pid:runtime > 300

When a rule cancels a query, Workload Manager logs the event in a log file on the segment host. These event records can be queried using the gp_wlm_events database view. The view depends on Greenplum external tables on each segment host and must first be set up using manage-event-tables.sh script. See Querying Workload Manager Event Data for details.

pg_terminate_backend

The pg_terminate_backend action executes the PostgreSQL pg_terminate_backend() function on the session matched by the condition expression. This is an unscoped action because a session must be terminated on all segments.

The following rule terminates a session that exceeds 75% total CPU utilization on any segment host and has run for more than five minutes:

pg_terminate_backend() when session_id:host:total_cpu > 75 and session_id:host:pid:runtime > 300

When a rule terminates a query, Workload Manager logs the event in a log file on each segment host. These event records can be queried using the gp_wlm_events database view. The view depends on Greenplum external tables on each segment host and must first be set up using manage-event-tables.sh script. See Querying Workload Manager Event Data for details.

Condition Expression

The condition expression (predicate) of a rule is a Boolean expression that identifies Greenplum Database queries you want to act upon.

Metrics can be compared to values using the following operators.

Operator Value Format Description
= A number for numeric metrics or a quoted string for strings. Matches only when the values are exactly equal.
!= A number for numeric metrics or a quoted string for strings. Matches when the values are not equal.
=~ Regular expression on the right side enclosed in slashes (/). metric =~ /sel.*by/ Performs a regular expression match between the string value and the specified regex. Posix regular expression syntax is used.
> Number Greater than
< Number Less than
>= Number Greater than or equal to
<= Number Less than or equal to

Expressions can be arbitrarily complex, joining multiple comparisons with Boolean AND and OR operators and parentheses to enforce precedence. For example:

host:pid:cpu_util > 50 or (host:pid:cpu_util > 30 and session_id:host:pid:usename = "fred")

Including Clause

The including keyword introduces a comma-separated list of metrics to add to the context when a rule triggers. Any metric referenced in the condition expression is automatically added to the context. To add context values for metrics not used in the condition expression, list the metrics after the including keyword.

Metrics in the including clause are specified without scopes. If the rules compiler cannot infer the scope from scopes already bound in the rule, the rule fails compilation with an error message.

The following rule adds the host:pid:long_name and host:pid:avg_cpu_util metrics to the context:

gpdb_record(message="CPU over 50%") when host:pid:cpu_util > 50 including long_name, avg_cpu_util

The host:pid:cpu_util metric is in the context because it is referenced in the condition clause.

When a gpdb_record action triggers, the context metrics are added to the context_args column of the gp_wlm_events table. When a host:pg_cancel_backend or pg_terminate_backend action triggers, the context metrics are added to the context column of the gp_wlm_events view.

The additional metric values can provide useful information when investigating recorded messages and termination events.

Metrics and Scopes

Metrics are data items collected by the agent, and include operating system statistics, OS process statistics, and database query data.

Workload Manager provides a rich set of metrics to use in condition expressions so that you can target queries and query processes with very specific characteristics. For example, a rule could target queries executed with a certain database role that access a certain table and use over 30% of CPU on any host.

The name of a metric is prefixed by its scope, which provides context for the metric. The host:pid scope of the host:pid:cpu_util metric, for example, means that the cpu_util metric is the percentage of CPU used by an OS process (pid) executing on a specific host (host). The session_id:host:pid scope for the session_id:host:pid:usename metric indicates that the usename metric is the database role executing a Greenplum Database segment query process. The session_id is the id of the query and host is the segment host where the query executor process, pid, is executing.

Metrics in the including list of a rule are specified without scopes. The rules compiler searches for included metrics in scopes already bound in the condition expression and fails if the scope cannot be inferred.

Rules must be written in a way to identify a single query executor process on a host. The following rule records a message when the resident memory size for any process exceeds 20%. The host:pid scope does not include a session_id, so an additional rexexp term is added to the condition expression match any query. This ensures that the host:pid:resident_size_pct metric is from a query executor process and that the action has a known query when it executes. Without the session_id:host:pid:usename comparison, this rule would fail to compile.

rule add mem_high_segment_useage_20
gpdb_record(message="MEM: high segment pct usage - 20%") when
host:pid:resident_size_pct > 20
and session_id:host:pid:usename =~/.*/

Workload Manager Metric Reference lists all of the metrics, their scopes, and their data formats.

datid Scope
The datid scope is for metrics that are values from a single database in the Greenplum Database system. The datid:datname metric, for example, can be used to restrict a rule to a specific database:

... and dataid:datname = 'my_db'

Metrics with datid scope must be combined in the condition expression with other metrics that identify a query process.

gpdb Scope
The gpdb scope is for metrics from the entire Greenplum Database system. There is currently just one such metric: gpdb:total_master_connections, which is the total number of client connects for all databases in the system. This metric could be used to prevent a rule from triggering until a specified number of connections is exceeded.

host Scope
The host scope applies to metrics that are values from a single host in the Greenplum cluster. These include the current date and time values from the host and the host’s total CPU utilization.

host:segment_id Scope
The host:segment_id Scope is used for metrics from a single Greenplum segment. It is used for metrics that report the virtual memory (vmem) usage for a segment.

host:pid Scope
The host:pid scope is for metrics referring to any operating system process on a host. These metrics include the memory, CPU, and I/O statistics available from Linux for OS processes. Metrics with host:pid scope can be used to narrow a rule to query processes using more host resources than expected.

session_id Scope
A session_id is the Greenplum cluster-wide ID for a database query. The metrics with session_id scope are CPU and disk I/O skew statistics for a single query that Workload Manager calculates from the host:pid metrics from all query executor processes on all segment hosts for the query.

session_id:host Scope
The session_id:host scope includes metrics that are aggregated memory, CPU, and I/O statistics for all processes on all hosts running a query.

session_id:host:segment_id Scope
The session_id:host:segment_id scope includes metrics that report the amount of virtual memory (vmem) consumed by a Greenplum segment for a session.

session_id:host:pid Scope
The session_id:host:pid scope is used for metrics that take values from a query executor process on a single segment host.