Project pgjobs

Full Project Specification

Copyright (c) Zlatko Michailov 2003


1. Purpose

2. Security vs Performance

3. For Application Developers

4. For Administrators

5. For pgjobs Developers



Back to top

1. Purpose

The purpose of the pgjobs project is to enable scheduled execution of PostgreSQL functions. Although there might be numerous ways to implement external agents for each individual application, a solid homogeneous architecture requires that each layer of functionality be implemented in a single paradigm. Hence, if PostgreSQL is the choice for a database server, to achieve a solid architecture, all data processing should also be implemented in PostgreSQL.

This document explains:
    (1) how pgjobs works;
    (2) how pgjobs could be used by application developers;
    (3) how pgjobs could be maintained during commercial usage; and
    (4) how pgjobs could be further developed by OpenSource developers.



Back to top

2. Security vs Performance

Security and performance are the two most important features for an Internet application. Obviously security costs some performance degradation. The tricky part is to find that balance where the system is secure enough and then to try to maximize performance.

pgjobs works on a per-database basis:
    (1) All user data - schedules and scheduled functions are stored in the database the functions belong to. Thus only users of that database may schedule jobs. Other users may not even see what is scheduled.
    (2) Some databases may be pgjobs-enabled, others may be not.
    (3) Each database has its own agent that fires the jobs on behalf of the postgres account.

A per-cluster implementation would have a better performance because it would take only one hit to obtain the information about all the jobs in the cluster. However, it would be difficult to prevent users from scheduling functions that they have no permission to execute otherwise. Moreover, in a commercial hosting environment competing customers may be able to see what their competitors have scheduled.

Starting from here, to maximize performance, each database agent (dbagent) is a LinuxThread (pthread) that remains alive for the entire lifetime of the master agent. This saves resources from constant process creation and destruction. There are two options for when a dbagent should wake up:
    (1) every certain period of time, e.g. 1 minute, to check for new/modified jobs and schedules; or
    (2) only when there is a job to be fired
The first option is simpler and more dynamic but it hits the database much more often than needed. Usually jobs are scheduled only by the database administrators and it takes only a restart of the master agent to update the dbagents. pgjobs implements option (2).



Back to top

3. For Application Developers

This section explains the basics of pgjobs. It is a prerequisite for section 5. For pgjobs Developers. All pgjobs project developers should read this section first. To learn how exactly pgjobs works, proceed with sections 4. For Administrators and 5. For pgjobs Developers.

Schedules

A schedule is a pattern that matches one or more points of time in the future.

Each schedule has a fixed precision of one minute.

There are three types of schedules:
    - Date
    - Weekly
    - Interval

A date schedule matches one ore more specific points of time. This type of schedule is deterministic. A date schedule may contain (in this order): year, month, day of the month, hour, and minute. If the schedules includes year, it matches at most one point of time. Otherwise, the schedule is recurring. Once a dimension is specified, all subsequent (finer) dimensions must be specified as well, e.g. if a schedule specifies a day of the month, it must specify hours and minutes within the day. If a negative number is provided for the day of the month, it is considered relative from the last day of that month, e.g. -1 is the last day of the month, -2 is the day before last, etc. The limit is -28.

A weekly schedule matches multiple specific points of time. This type of schedule is deterministic. A weekly schedule may contain (in this order): day of the week, hour, and minute. Once a dimension is specified, all subsequent (finer) dimensions must be specified as well, e.g. if a schedule specifies a day of the week, it must specify hours and minutes within the day. The days of the week are specified as follows: 0 - Sunday, 1 - Monday, 2 - Tuesday, 3 - Wednesday, 4 - Thursday, 5 - Friday, 6 - Saturday, 7 - Sunday.

An interval schedule matches multiple points of time relatively offset from the moment when the pgjobs agent was started. This type of schedule is non-deterministic. An interval schedule may contain (in this order): number of days, hour, and minute. Once a dimension is specified, all subsequent (finer) dimensions must be specified as well, e.g. if a schedule specifies number of days, it must specify hours and minutes within the day. Any positive number (within 32-bits) could be a number of days.

Bellow is the canonical format of a schedule specification in Backus-Naur form. This canonical format is in terms of the local time zone of the machine.

schedule ::= { date-schedule | weekly-schedule | interval-schedule }
date-schedule ::= '{D|d} [[[[yyyy.]mm.]dd] hh:]nn'
weekly-schedule ::= '{W|w} [[wd] hh:]nn'
interval-schedule ::= '{I|i} [[id] hh:]nn'
dd ::= 1..31 | -1..-28
wd ::= 1..7
id ::= 1...
hh ::= 0..23
nn ::= 00..59


Examples of Schedules:

'D 2003.12.01 17:30' - date schedule, triggers only once, on December 1st, 2003 at 5:30pm
'D 12.1 17:30', - date schedule, triggers every year on December 1st at 5:30pm
'D 17:30' - date schedule, triggers every day at 5:30pm
'D -1 17:30' - date schedule, triggers at the last day of each month at 5:30pm
'W 1 7:30' - weekly schedule, triggers every Monday at 5:30pm
'I 0:05' - interval schedule, triggers every 5 minutes after the pgjobs agent starts up

Manipulating Schedules

The following API is a set of functions from the pgjobs SQL schema. Each pgjobs-enabled database is a different instance of this API:

-- Creates a new schedule with a unique name $1
-- and a definition $2.
function pgj_create_schedule( text, text )

-- Modifies the definition of an existing schedule with name $1.
function pgj_update_schedule( text, text )

-- Removes an existing schedule with name $1
-- If existing jobs use the target schedule and the $2
-- flag is FALSE, the operation fails. If the $2 flag is set
-- to TRUE, the target schedule is detached from any existing job, but the
-- job definitions remain.
function pgj_drop_schedule( text, boolean )

-- Removes all schedules that have no jobs attached to them.
function pgj_drop_unused_schedules()

-- Shows all existing schedules and jobs based on each of them.
view pgj_schedules

Jobs

Jobs are database functions scheduled for execution by the pgjobs agent. Any database function that accepts no parameters, or one that could be called with hardcoded parameters, could be used for a job. A job may be delayed until another schedule triggers. A job may have none, one, or multiple recurrences based on separate schedules. In addition to that, a job may be marked as startup, in which case it will be executed right after the pgjobs agent starts up regardless of any schedule attached to the job. Startup jobs may have additional recurrences.

Manipulating Jobs

The following API is a set of functions from the pgjobs SQL schema. Each pgjobs-enabled database is a different instance of this API:

-- Creates a new job with a unique name $1, which will execute function $2.
-- $2 must include '()'. If the function expects parameters,
-- hardcoded values must be provided.
-- If $3 is a schedule name, the job will not start until the specified
-- schedule triggers. If $4 is a schedule name, the job will be set for
-- recurrence according to the specified schedule.
-- Additional recurrence schedules may be provided later.
function pgj_create_job( text, text, text, text )

-- Modifies an existing job. $1 is the job name. $2 is the new job name or NULL.
-- $3 is the new function call or NULL. $4 is the name new of the new start schedule,
-- NULL to keep the start schedule value as it is,
-- or an empty string '' to remove any start schedule.
function pgj_update_job( text, text, text, text )

-- Adds a recurrence schedule to an existing job. $1 is the job name.
-- $2 is a schedule name. A job may have zero, one, or more recurrence schedules.
function pgj_add_job_schedule( text, text )

-- Removes a recurrence schedule from an existing job. $1 is the job name.
-- $2 is the schedule name. A job may have zero, one, or more recurrence schedules.
function pgj_drop_job_schedule( text, text )

-- Removes an existing job. $1 is the job name.
-- Some schedules may end up unused.
function pgj_drop_job( text )

-- Creates a new startup job with a unique name $1, which will execute function $2.
-- $2 must include '()'. If the function expects parameters,
-- hardcoded values must be provided.
-- Additional recurrence schedules may be provided later.
function pgj_create_startup_job( text, text )

-- Shows all existing jobs and the schedules they are based on
view pgj_jobs

Examples

This example shows how to create a startup job with no additional recurrences. The function prepare_cache() will be executed once right after the pgjobs agent starts up. The name of the job is StartupPrepare.

pgj_create_startup_job( 'StartupPrepare', 'prepare_cache()' );

The next example shows how to create a job that is executed every day at 6:30am and 7:30pm. The name of the job is OffLine. The name of the function that will be executed is off_line().

pgj_create_schedule( 'Daily0630', 'D 6:30' );
pgj_create_schedule( 'Daily1930', 'D 19:30' );
pgj_create_job( 'OffLine', 'off_line()', null, 'Daily0630' );
pgj_add_job_schedule( 'OffLine', 'Daily1930' );

The same example could be rewritten as:

pgj_create_schedule( 'Daily0630', 'D 6:30' );
pgj_create_schedule( 'Daily1930', 'D 19:30' );
pgj_create_job( 'OffLine', 'off_line()', null, null );
pgj_add_job_schedule( 'OffLine', 'Daily0630' );
pgj_add_job_schedule( 'OffLine', 'Daily1930' );

The following example shows how to create a job that is executed every day at 8pm starting June 1st, 2004. The name of the job is EndOfDay. The name of the function that will be executed is end_of_day().

pgj_create_schedule( 'June1st2003', 'D 2003.06.01 0:00' );
pgj_create_schedule( 'Daily2000', 'D 20:00' );
pgj_create_job( 'EndOfDay', 'end_of_day()', 'June1st2003', 'Daily2000' );



Back to top

4. For Administrators

This section explains how to configure and maintain pgjobs. It is mainly intended for system administrators but it is also a prerequisite for pgjobs contributors.

Enabling pgjobs in a Database

Jobs are enabled on a per-database basis. See discussion in section 2. Security vs Performance. Enabling jobs results in:
    (1) creating the pgjobs SQL schema in the specified database;
    (2) adding the specified database to the pgjobs agent's list;

To enable pgjobs in a given database use the following shell command. Note that the Notification and Recipient options are disregarded in this release:

pgjobs enabledb [-h HOST] [-p PORT] -d DATABASE [-u USER] [-n NOTIFICATION] [-r RECIPIENT]

If pgjobs has already been enabled for that database, the operation will fail. A disabling command must be issued and then the above command should be retried. Although a remote host is supported, it is recommended to have the agent running on the same machine as the PostgreSQL server to avoid a potential local time zone mismatch between the two machines.

To disable pgjobs in a given database use the following shell command:

pgjobs disabledb [-h HOST] [-p PORT] -d DATABASE

The above command removes the pgjobs SQL schema from that database and removes the database from the pgjobs agent's list.

A complete reference of the pgjobs tool is provided later in this section.

Logging

The pgjobs log files are created in a file structure under:

/var/log/pgjobs

Each agent has its own log file for every day it runs. The master agent's folder is called master right under the pgjobs log root. Each dbagent is identified by a unique number and has its own log folder with that name. Thus the full spec of the master agent's log file for June 23rd, 2003 looks like this:

/var/log/pgjobs/master/2003.06.23.log

The full log file spec of an agent identified as 1234 for the same date is:

/var/log/pgjobs/1234/2003.06.23.log

Each row of a log file represents one message and has the following structure:

date|time|status|message

where status is the status of the agent when the event occurred, and message is a textual description of the event.

Agents could be configured to log different subsets of events. Log levels from 0 to 5 are supported. Each log level includes the set of messages from the lower level levels plus some more:
    Level 0 logs only errors;
    Level 1 logs errors and essential system information;
    Level 2 logs errors and more system information;
    Levels 3 and 4 are reserved for future use;
    Level 5 logs everything including debugging information;

By default the log level of each agent is set to 0. See Advanced Configuration later in this section to learn how to change it.

Advanced Configuration

pgjobs is configured through one single file:

/usr/share/pgjobs/pgjobs.conf

Each agent has one configuration row with the following structure:

agent=setting|setting|...

Blank spaces are not allowed around the '=' sign or the '|' signs. '#' comments may be used where the text from the '#' to the end of the line is treated as a comment.

The pgjobs system has its own row of settings:

system=version

# This is the currently installed version of pgjobs
system=0.00

Following are the master agent's configuration schema and an example:

master=loglevel

#Log only errors for master
master=0

Next are a dbagent's configuration schema and an example:

id=host|port|database|loglevel|user|notification|recipient

#Host: default (localhost)
#Port: default (5432)
#Database: MyExamples
#Log Level: 5 - all debug messages
#User: default (postgres)
#Notification: none (for future use; no notification is supported)
#Recipient: none (for future use; no notification is supported)
1234=default|default|MyExamples|5|default|none

The recommended way to configure pgjobs is to use the pgjobs tool which is described later in this section. It is also possible for advanced administrators to edit the configuration file directly at their own risk.

The Notification and Recipient columns are for future use only. The values provided are not interpreted. They are only used as placeholders to make their future implementation and activation easier.

Tools

All the essential administrative functions can be performed through the pgjobs tool. It has the following syntax:

pgjobs command options

Following is the complete list of supported commands with brief descriptions:

Enable pgjobs on a database. It was described earlier in this section. The -n and -r options have no effect in this release:

pgjobs enabledb [-h HOST] [-p PORT] -d DATABASE [-u USER] [-n NOTIFICATION] [-r RECIPIENT]

Disable pgjobs on a database. It was described earlier in this section:

pgjobs disabledb [-h HOST] [-p PORT] -d DATABASE

View the system configuration without changing it:

pgjobs viewsystem

View the master configuration without changing it:

pgjobs viewmaster

View the configuration of a database without changing it:

pgjobs viewdb [-h HOST] [-p PORT] -d DATABASE

Modify the system configuration. The command prints the new configuration after the change has been applied:

pgjobs updatesystem -v VERSION

Modify the master configuration. The command prints the new configuration after the change has been applied:

pgjobs updatemaster -l LOGLEVEL

Modify a dbagent configuration. The command prints the new configuration after the change has been applied:

pgjobs updatedb [-h HOST] [-p PORT] -d DATABASE [-hh NEWHOST] [-pp NEWPORT] [-dd NEWDATABASE] [-l LOGLEVEL] [-u USER] [-n NOTIFICATION] [-r RECIPIENT]



Back to top

5. For pgjobs Developers

This section explains the internals of pgjobs. It is assumed that you have read all the previous sections already. This section is intended for developers who want to contribute to the project and for advanced application developers who want to get the best out of pgjobs.

Database Schema

pgjobs uses three tables that are hidden from the public group. The table names are prefixed with 'pgj_'. The goal is to let the application developers distinguish these tables from their own without making these tables part of the system catalogs. To stress the fact that these tables should not be accessed directly, their names are suffixed with '_'. Here are the table definitions:

pgj_schedule_
(
    schid: int
    schname: text
    schtype: char
    schyear: int
    schmonth: int
    schday: int
    schhour: int
    schminute: int
)


pgj_job_
(
    jobid: int
    jobname: text
    jobfunction: text
    jobisstartup: boolean
    jobstartschid: int
)


pgj_job_schedule_
(
    jsjobid: int
    jschid: int
)

Agent

The pgjobs agent is implemented by several C++ classes and some startup C code. The startup code includes a main() function and a TERM signal handler. The main() function starts the master agent, and the TERM signal handler stops the master agent. The master agent (MasterAgent class) reads the configuration settings through the Config class, and creates and starts a dbgent (DbAgent class) for each pgjobs-enabled database. When the master agent stops, it stops all dbagents and waits for them to exit. All dbagents and the master agent are implemented as separate threads within the same process. The MasterAgent and DbAgent classes derive from a base class, Thread, that wraps the essential Pthread function calls. The executable of the pgjobs agent is pgjobsagent. A separate class, Log, is used to log messages into text files.

The Config class is shared between the pgjobs agent and the pgjobs configuration tool.

Tools

There is only one tool to manage pgjobs - pgjobs. It provides a command line interface to the Config class, which both reads and writes the pgjobs configuration file.

Build Dependencies

In order to build the pgjobs binaries, you need to download MSDK and install it under the directory that contains your pgjobs project, i.e. the pgjobs and the msdk directories must have the same parent.

pgjobs does not depend on the PostgreSQL main baseline. pgjobs uses the psql tool to communicate with a PostgreSQL server. This solution is less performant than using the PostgreSQL libraries. It was chosen to eliminate this particular dependency.

Build

The baseline is built through a make file, makefile, located in the src/agent sub-folder. The make file prepares a directory parallel to src, distrib, that contains the distribution files needed to install pgjobs.

The make file is produced by a KDevelop template - Makefile.am. In the same directory the KDevelop project file, agent.kdevprj, could be found and used.

The make file is designed to support Zlatko Michailov's free project support tools - build, backup, etc. In that case the distribution files are created in a sub-directory of distrib named as the current version of the project.

Deployment

The distribution directory (whether it is the distrib directory or a version sub-directory) contains two executable scripts - pgjobs-install and pgjobs-remove. The majority of pgjobs files are installed in:

/usr/share/pgjobs

Here is the complete list of files and their deployment locations:

agent: /usr/share/pgjobs/pgjobsagent
agent rc: /etc/rc.d/init.d/pgjobs
configuration tool: /usr/share/pgjobs/pgjobs
link to configuration tool: /usr/bin/pgjobs
configuration settings: /usr/share/pgjobs/pgjobs.conf
root log folder: /var/log/pgjobs
temp folder: /tmp/pgjobs
all other files: /usr/share/pgjobs/*

No database is pgjobs-enabled right after the installation of the pgjobs system. A database must be enabled explicitly.

pgjobs-remove disables all pgjobs-enabled databases and deletes the binary-, configuration-, and log files.


Copyright (c) 2003-2004, Zlatko Michailov