Parallel queries (Robert Haas, Amit Kapila, David Rowley, many others)
With 9.6, PostgreSQL introduces initial support for parallel execution of large queries. Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized. Hash joins and nested loops can be performed in parallel, as can aggregation (for supported aggregates). Much remains to be done, but this is already a useful set of features.
Allow GIN index builds to make effective use of maintenance_work_mem settings larger than 1GB (Robert Abraham, Teodor Sigaev)
Add pages deleted from a GIN index's pending list to the free space map immediately, to reduce bloat if the table isn't vacuumed often (Jeff Janes, Teodor Sigaev)
Add gin_clean_pending_list() function to allow manual invocation of pending-list cleanup for a GIN index, separately from vacuuming or analyzing the parent table (Jeff Janes)
Improve handling of dead index tuples in GiST indexes (Anastasia Lubennikova)
Dead index tuples are now marked as such when an index scan notices that the corresponding heap tuple is dead; and when inserting tuples, marked-dead tuples will be removed if needed to make space on the page.
Add an SP-GiST operator class for type box (Alexander Lebedev)
Avoid re-vacuuming pages containing only frozen tuples (Masahiko Sawada, Robert Haas)
Formerly, an anti-wraparound vacuum had to visit every page of a table whether or not there was anything to do there. Now, pages containing only already-frozen tuples are identified in the table's visibility map, and can be skipped by vacuum even when it's doing transaction wraparound prevention. This should greatly reduce the cost of maintaining large tables containing mostly-unchanging data.
Avoid useless heap-truncation attempts during VACUUM (Jeff Janes, Tom Lane)
This change avoids taking an exclusive table lock in some cases where no truncation is really possible. The main benefit comes from avoiding unnecessary query cancellations on standby servers.
Reduce interlocking on standby servers during replay of btree index vacuuming operations (Simon Riggs)
This change avoids substantial replication delays that sometimes occurred while replaying such operations.
Drop entries from GROUP BY if they are functionally dependent on other entries (David Rowley)
If a GROUP BY clause includes all columns of a non-deferred primary key, as well as other columns of the same relation, those other columns are redundant and can be dropped from the grouping. This saves computation in many common cases.
When appropriate, postpone evaluation of SELECT output expressions till after ORDER BY sorting (Konstantin Knizhnik)
This change ensures that volatile or expensive functions in the output list are executed in the order suggested by ORDER BY, and that they are not evaluated more times than required when there's a LIMIT. Previously, these properties held if the ordering was performed by an indexscan or pre-mergejoin sort, but not if it was performed by a top-level sort step.
Where feasible, trigger kernel writeback after a configurable number of writes, to prevent accumulation of dirty data in kernel disk buffers (Fabien Coelho, Andres Freund)
PostgreSQL writes data to the kernel's disk cache, from where it should be flushed to physical storage in due time. Many operating systems are not very smart about managing this, and will allow large amounts of dirty data to accumulate then decide to flush it all at once, leading to long delays for new I/O requests. This change attempts to alleviate this problem by explicitly requesting data flushes after a configurable interval.
On Linux, sync_file_range() is used for this purpose, and the feature is on by default because that function has few downsides. The feature is also available on other platforms that have msync() or posix_fadvise(), but those interfaces have some undesirable side-effects so the feature is not enabled by default on other platforms.
Perform checkpoint writes in sorted order (Fabien Coelho, Andres Freund)
Previously, checkpoints wrote out dirty pages in whatever order they happen to appear in within shared buffers, which usually is nearly random. That performs poorly, especially on rotating media. This change causes checkpoint-driven writes to be done in order by file and block number, and to be balanced across tablespaces.
Allow old MVCC snapshots to be invalidated after a configurable timeout (Kevin Grittner)
Normally, deleted tuples cannot be physically removed by vacuuming until the last transaction that could"see" them is gone. A transaction that stays open for a long time can thus cause considerable table bloat because space cannot be recycled. This feature allows setting a time-based limit, via the new configuration parameter old_snapshot_threshold, on how long an MVCC snapshot is guaranteed valid. After that, dead tuples are candidates for removal. A transaction using an outdated snapshot will get an error, but only if it attempts to read a page that's been modified recently enough that it might have contained such data.
Allow using an index-only scan with a partial index when the index's predicate involves column(s) not stored in the index (Tomas Vondra, Kyotaro Horiguchi)
An index-only scan is now allowed if the query mentions such columns only in WHERE clauses that match the index predicate.
Use foreign key relationships to infer selectivity for multi-column join predicates (Tomas Vondra, David Rowley)
If a table t has a multi-column foreign key restriction, say (a,b) REFERENCES r (x,y), then a WHERE condition such as t.a = r.x AND t.b = r.y cannot select more than one r row per t row. The planner formerly considered the AND'ed conditions to be independent and would often drastically misestimate the selectivity as a result. Now it compares the WHERE conditions to applicable foreign key constraints and arrives at a better estimate.
Improve aggregate-function performance by sharing calculations across multiple aggregates if they have the same arguments and transition functions (David Rowley)
Speed up visibility tests for recently-created tuples by checking our transaction snapshot, not pg_clog, to decide if the source transaction should be considered committed (Jeff Janes, Tom Lane)
Allow tuple hint bits to be set sooner than before (Andres Freund)
Improve performance of short-lived prepared transactions (Stas Kelvich, Simon Riggs, Pavan Deolasee)
Two-phase commit information is now written only to WAL during PREPARE TRANSACTION, and read back from there during COMMIT PREPARED. A separate state file is created only if the pending transaction does not get committed or aborted by the time of the next checkpoint.
Improve performance of memory context destruction (Jan Wieck)
Improve performance of ResourceOwners with many tracked objects (Aleksander Alekseev)
Improve speed of the output functions for timestamps, times, and dates (David Rowley, Andres Freund)
Avoid some unnecessary cancellations of hot-standby queries during replay of actions that take AccessExclusiveLocks (Jeff Janes)
Improve ANALYZE's estimates for columns with many nulls (Tomas Vondra, Alex Shulgin)
Previously the code tended to underestimate the number of non-null distinct values in a column with many nulls, and it also might make poor decisions about what is a most-common value.
Improve planner's estimate of the number of distinct values in a query result (Tomas Vondra)
Extend relations multiple blocks at a time, when there is contention for the relation's extension lock (Dilip Kumar)
This improves scalability by decreasing contention.
Improve sorting performance by using quicksort, not replacement selection, within steps of an external sort (Peter Geoghegan)
The new approach makes better use of CPU cache for typical cache sizes and data volumes. Where necessary, the behavior can be adjusted via the new configuration parameter replacement_sort_tuples, which see for further details.
Speed up text sorts where the same strings occur multiple times (Peter Geoghegan)
Speed up sorting of uuid, bytea, and char(n) fields by using "abbreviated" keys (Peter Geoghegan)
Support for abbreviated keys has also been added to the non-default operator classes text_pattern_ops,varchar_pattern_ops, and bpchar_pattern_ops. Processing of ordered-set aggregates can also now exploit abbreviated keys.
Speed up CREATE INDEX CONCURRENTLY by treating TIDs as 64-bit integers during the sort phase (Peter Geoghegan)
Increase the number of clog buffers for better scalability (Amit Kapila, Andres Freund)
Reduce contention for the ProcArrayLock (Amit Kapila, Robert Haas)
Improve performance by moving buffer content locks into the buffer descriptors (Andres Freund, Simon Riggs)
Replace shared-buffer header spinlocks with atomic operations to improve scalability (Alexander Korotkov, Andres Freund)
Use atomic operations, rather than a spinlock, to protect an LWLock's wait queue (Andres Freund)
Partition the freelist for shared hash tables, to reduce contention on many-CPU servers (Aleksander Alekseev)
Speed up expression evaluation in PL/pgSQL by keeping ParamListInfo entries for simple variables valid at all times (Tom Lane)
Avoid reducing the SO_SNDBUF setting below its default on recent Windows versions (Chen Huajun)
Improve the pg_stat_activity view's information about what a process is waiting for (Amit Kapila, Ildus Kurbangaliev)
Historically a process has only been shown as waiting if it was waiting for a heavyweight lock. Now waits for lightweight locks and buffer pins are also shown in pg_stat_activity. Also, the type of lock being waited for is now visible. These changes replace the waiting column with wait_event_type and wait_event.
Add pg_stat_progress_vacuum system view to provide progress reporting for VACUUM operations (Amit Langote, Robert Haas, Vinayak Pokale, Rahila Syed)
Add pg_stat_wal_receiver system view to provide information about the state of a hot-standby server's WAL receiver process (Michael Paquier)
Add pg_blocking_pids() function to reliably identify which sessions block which others (Tom Lane)
This function returns an array of the process IDs of any sessions that are blocking the session with the given process ID. Historically users have obtained such information using a self-join on the pg_locks view; but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has made the approach entirely impractical, since locks might be held or awaited by child worker processes rather than the session's main process.
Add pg_control_system(), pg_control_checkpoint(), pg_control_recovery(), and pg_control_init() functions to expose fields of pg_control to SQL (Joe Conway, Michael Paquier)
Add function pg_current_xlog_flush_location() to expose the current transaction log flush location (Tomas Vondra)
Add function pg_notification_queue_usage() to report how full the NOTIFY queue is (Brendan Jurd)
Limit the verbosity of memory context statistics dumps (Tom Lane)
The memory usage dump printed to the postmaster log during an out-of-memory failure now summarizes statistics when there are a large number of memory contexts, rather than possibly printing a very large report. There's also a "grand total" summary line now.
Change display format for NextXID in pg_controldata and related places (Joe Conway, Bruce Momjian)
Display epoch-and-transaction-ID values in the format number:number. The previous format number/numberwas confusingly similar to that used for LSNs.
Add --with-systemd configure switch to enable calling sd_notify() at server start and stop (Peter Eisentraut)
This allows the use of systemd service units of type notify, which greatly simplifies management ofPostgreSQL under systemd.
Allow effective_io_concurrency to be set as a tablespace parameter, to support cases where different tablespaces have different I/O characteristics (Julien Rouhaud)
Allow sessions to be terminated automatically if they sit too long in an idle-in-transaction state (Vik Fearing)
This behavior is enabled and controlled by the new configuration parameteridle_in_transaction_session_timeout. It can be useful to prevent forgotten transactions from holding onto locks or preventing vacuum cleanup for very long periods.
Add log_line_prefix option %n to print the time as a Unix epoch, with milliseconds (Tomas Vondra, Jeff Davis)
Merge the archive and hot_standby values of the wal_level configuration parameter into a single valuereplica (Peter Eisentraut)
Making a distinction between these settings no longer appears to be a good idea, and it's in the way of planned future simplification of replication setup. The old names are still accepted but are converted internally.
Allow the server's SSL key file to have group read access if owned by root (Christoph Berg)
Formerly we insisted on the key file being owned by the user running the PostgreSQL server, but that is inconvenient for some systems (such as Debian) that wish to manage certificates centrally. So also allow the case where the key file is owned by root and has group read access. It's up to the root admin to ensure that such a group doesn't include any untrusted users.
Force backends to exit if the postmaster dies (Rajeev Rastogi, Robert Haas)
Under normal circumstances the postmaster should always outlive its child processes. If for some reason it dies, cause backend sessions to exit with an error. Formerly, existing backends would continue to run until their client disconnects; but that is unsafe and inefficient, and furthermore it prevents a new postmaster from being started until the last old backend is gone. Backends will detect postmaster death when waiting for client I/O, so the exit will not be instantaneous, but in most circumstances it should happen no later than the end of the current query.
Check for serializability conflicts before reporting constraint-violation failures (Thomas Munro)
When using serializable transaction isolation, it is desirable that any error due to a concurrent transaction should manifest as a serialization failure, thereby cueing the application that a retry might succeed. Unfortunately, this doesn't reliably happen for duplicate-key failures caused by concurrent insertions. This change ensures that such an error will be reported as a serialization error, if the application explicitly checked for the presence of a conflicting key (and didn't find it) earlier in the transaction.
Ensure that invalidation messages are recorded in WAL even when issued by a transaction that has no XID assigned (Andres Freund)
This fixes some corner cases in which transactions on standby servers failed to notice changes such as new indexes.
Prevent multiple processes from trying to clean a GIN index's pending list concurrently (Teodor Sigaev, Jeff Janes)
This had been intentionally allowed, but it causes race conditions that can result in vacuum missing index entries it needs to delete.
Support synchronous replication with multiple synchronous standby servers, not just one (Masahiko Sawada, Beena Emerson, Michael Paquier, Fujii Masao, Kyotaro Horiguchi)
The number of standby servers that must acknowledge a commit before it's considered done is now configurable as part of the synchronous_standby_names parameter.
Add new setting remote_apply for configuration parameter synchronous_commit (Thomas Munro)
In this mode, the master waits for the transaction to be applied on the standby server, not just written to disk. That means that you can count on a transaction started on the standby to see all commits previously acknowledged by the master.
Add a feature to the replication protocol, and a corresponding option to thepg_create_physical_replication_slot() function, to allow reserving WAL immediately when creating a replication slot (Gurjeet Singh, Michael Paquier)
This allows creation of a replication slot to guarantee that all the WAL for a base backup will be available afterwards.
Add a --slot option to pg_basebackup (Peter Eisentraut)
This lets pg_basebackup use a replication slot defined for WAL streaming. After the base backup completes, selecting the same slot for regular streaming replication allows seamless startup of the new standby server.
Extend pg_start_backup() and pg_stop_backup() to support non-exclusive backups (Magnus Hagander)
Widen tuples-processed counters to 64 bits (Andreas Scherbaum)
This change allows command tags for SELECT etc. to correctly report tuple counts larger than 4 billion. So will PL/pgSQL's GET DIAGNOSTICS ... ROW_COUNT command.
Avoid doing encoding conversions by double-conversion through MULE_INTERNAL encoding (Tom Lane)
Previously, many conversions for Cyrillic and Central European single-byte encodings were done by converting to a related MULE_INTERNAL coding scheme and then to the destination encoding. Aside from being inefficient, this meant that when the conversion encountered an untranslatable character, the error message would confusingly complain about failure to convert to or from MULE_INTERNAL, rather than the user-visible encodings.
Consider performing joins of foreign tables remotely only when the applicable user mappings match (Shigeru Hanada, Ashutosh Bapat)
Previously, the foreign join pushdown infrastructure left the question of security entirely up to individual foreign data wrappers, but it would be easy for an FDW to inadvertently open up subtle security holes that way. So, make it the core code's job to determine which user mapping OID is relevant, and don't attempt join pushdown unless it's the same for all relevant relations.
Allow COPY to copy the output of an INSERT/UPDATE/DELETE ... RETURNING query (Marko Tiikkaja)
Previously, an intermediate CTE had to be written to get this result.
Introduce ALTER object DEPENDS ON EXTENSION (Abhijit Menon-Sen)
This command allows a database object to be marked as depending on an extension, so that it will automatically go away if the extension is dropped (without needing CASCADE). But the object is not part of the extension, and thus for example will be dumped separately by pg_dump.
Make ALTER object SET SCHEMA do nothing when the object is already in the requested schema, rather than throwing an error as it historically has for most object types (Marti Raudsepp)
Add options to ALTER OPERATOR to change the selectivity functions associated with an existing operator (Yury Zhuravlev)
Add an IF NOT EXISTS option to ALTER TABLE ADD COLUMN (Fabrízio de Royes Mello)
Reduce the lock strength needed by ALTER TABLE when setting fillfactor and autovacuum-related relation options (Fabrízio de Royes Mello, Simon Riggs)
Introduce CREATE ACCESS METHOD to allow extensions to create index access methods (Alexander Korotkov, Petr Jelínek)
Add a CASCADE option to CREATE EXTENSION, to automatically create extensions it depends on (Petr Jelínek)
Remove the long-deprecated CREATEUSER/NOCREATEUSER options from CREATE ROLE and allied commands (Tom Lane)
CREATEUSER actually meant SUPERUSER, for ancient backwards-compatibility reasons. That's been a constant source of confusion for people who (reasonably) expect it to mean CREATEROLE. It's been deprecated for ten years now, so fix the problem by removing it.
Make CREATE TABLE ... LIKE include an OID column if any source table has one (Bruce Momjian)
If a CHECK constraint is declared NOT VALID in a table creation command, automatically mark it valid (Amit Langote, Amul Sul)
This matches the longstanding behavior of FOREIGN KEY constraints.
Fix DROP OPERATOR to clear pg_operator.oprcom and pg_operator.oprnegate links to the dropped operator (Roma Sokolov)
Formerly such links were left as-is, which could pose a problem in the somewhat unlikely event that the dropped operator's OID was reused for another operator.
Disallow creation of indexes on system columns, except for OID (David Rowley)
Such indexes were never considered supported, and would very possibly misbehave since the system might change the system-column fields of a tuple without updating indexes. But there was no error check to prevent them from being created.
Use the privilege system to manage access to sensitive functions (Stephen Frost)
Formerly, many security-sensitive functions contained hard-wired checks that would throw an error if they were called by a non-superuser role. This forced use of superuser roles for some relatively pedestrian tasks. The hard-wired error checks are now gone in favor of making initdb revoke the default public EXECUTEprivilege on these functions. This allows installations to choose to grant usage of such functions to trusted roles that need not have full superuser privilege.
Treat role names beginning with pg_ as reserved (Stephen Frost)
User creation of such role names is now disallowed. This prevents conflicts with built-in roles created byinitdb.
Create some built-in roles that can be used to grant access to what were previously superuser-only functions (Stephen Frost)
Allow omitting one or both boundaries in an array slice specifier, for example array_col[3:] (Yury Zhuravlev)
Omitted boundaries are taken as the upper or lower limit of the corresponding array subscript. This allows simpler specification of many common use-cases.
Be more careful about out-of-range dates and timestamps (Vitaly Burovoy)
This change prevents unexpected out-of-range errors for timestamp with time zone values very close to the implementation limits. Previously, the "same" value might be accepted or not depending on the timezonesetting, meaning that a dump and reload could fail on a value that had been accepted when presented. Now the limits are enforced according to the equivalent UTC time, not local time, so as to be independent oftimezone.
Also, PostgreSQL is now more careful to detect overflow in operations that compute new date or timestamp values, such as date+integer.
In the geometric data types, make sure that infinity and NaN component values are treated consistently during input and output (Tom Lane)
Such values will now always print the same way as they would in a simple