How should I actually implement a record history table with PostgreSQL?

Question

I want to add revisioning for records in an existing application which stores data in a PostgreSQL database. I read about strategies e.g. in this question, this question and this blog post.

I think that the approach to create a second history table which will rarely be queried will work best. However I do have some practical problems. Let's say that this is my table I want to add revision control to:

create table people(
  id serial not null primary key,
  name varchar(255) not null
);

For this very simple table my history table could look like this:

create table people_history(
  peopleId int not null references people(id) on delete cascade on update restrict,
  revision int not null,
  revisionTimestamp timestamptz not null default current_timestamp,
  name character varying(255) not null,
  primary key(peopleId, revision)
);

And this brings the first problems up:

How do I generate the revision number?

Of course I could create a sequence from which I request revision numbers which would be easy. However that would leave large gaps between revisions per person as many people share the same sequence and it would feel more natural if the revision numbers were ascending numbers without gaps per person. So I am tempted to find my revision number by select max(revision)+1 from ... where peopleId=.... However that could lead to a race condition if two threads ask for the next revision number and try to insert. That is very unlikely I have to admit (especially in my case where only few updates happen anyway) and would not cause data to corrupt as that would be a duplicate primary key and thus cause a transaction rollback, but it is not pretty either. I wonder if there is a prettier solution.

How do I insert data into the history table?

Two ways come to mind: Manually on every statement that updates the main table or using a trigger. A trigger sounds less error-prone as it is less likely that I forget about a query somewhere. However I cannot communicate to the application exactly which revision number was just created, can I? So if I want to create a couple of event tables like this:

create table peopleUserEditEvent (
  poepleId int not null,
  revision int not null,
  userId int not null references users(id) on delete set null on update restrict,
  comment text not null default '',
  primary key(paopleId, revision),
  foreign key (peopleId, revision) references people_history
);

That lists some metadata for revisions which explains why the revision was changed. In this case a user with a specific ID edited the data and might have supplied a comment.

In another case (and another event table) a cronjob might have changed something and documents the event which probably has no userId and no comment but other metadata.

To add those event data I need the revision id and if the revision id was created by a trigger it will be difficult to find out (or is there a practical way to do so?).

Revision numbers are important? I mean the gaps. You can have data, a serial or bigserial can do the work. The trigger, it is ok. But if you want more control and returning data, you can make a function to update data and your client application must use it only. So you can return the sequence value. — user_0, Feb 10 '15 at 14:41
Why do you need a revision number when you have the timestamp? You can create the number on the fly using row_number() over (order by revisionTimestamp) — Frank Heikens, Feb 10 '15 at 14:52
@user_0: The revision numbers are not really important, it would just feel more natural to have them. Looking at the DB a revision number would actually mean something to me as a human as I can see that after revision 42 comes revision 43 and revision 42 means that it is the 42th time something was changed. Of course there are other ways to calculate these numbers when needed (which I will do if no better idea comes to mind). — yankee, Feb 10 '15 at 15:51
@FrankHeikens: If two changes happen within the same second that would be a problem if I have no revision number. Additionally I want to use foreign keys linking to the revision and reference the revision e.g. from URLs. Of course the foreign keys would be possible and if I implement a consistent schema how the timestamp is serialized for external access that would be possible as well, but just a revision count seems simpler to me. — yankee, Feb 10 '15 at 15:58
@yankee, in table people_history, the column revisionTimestamp what will contains? I mean: it can contain the time when data is archived. If you need to store when data was inserted you will need a few code more. — user_0, Feb 10 '15 at 17:33
@yankee: What is your definition of "within the same second"? A timestamp has a microsecond precision, it's next to impossible to create a record with the exactly the same timestamp from 2 different concurrent processes. — Frank Heikens, Feb 10 '15 at 17:51
@FrankHeikens: OK... What if after using the software for a while the sysadmin notices that the system time is wrong by 10 minutes and adjust the system time. If anyone inserts data in those 10 minutes than my history is borked. Yes not very likely, but a simple integer which is incremented on every revision seems to be perfectly robust and universally understandable. I also do not need to deal with different serialization techniques for external references or the like. Anyhow this is really not the problem. A sequence would do. I just wonder if there is a better way... — yankee, Feb 10 '15 at 20:17
timestamptz and now() have microsecond precision, but `now()` (A.K.A `CURRENT_TIMESTAMP`) never changes during a transaction, so,if two updates happen during a transaction a single timestamp won't tell you which was first. (now() has reduced precision on windows. ) — Jasen, Feb 12 '15 at 03:57

score 0 · Answer 1 · answered Feb 13 '15 at 21:03

Well, you need one replication strategy for all tables and column you have , you can create one table to maintain all changes and insert on anytime you make a UPDATE INSERT or DELETE statement, maybe with this exemple of framwork idempiere changelog can help you

CREATE TABLE ad_changelog (
  ad_changelog_id NUMERIC(10,0) NOT NULL,
  ad_session_id NUMERIC(10,0) NOT NULL,
  ad_table_id NUMERIC(10,0) NOT NULL,
  ad_column_id NUMERIC(10,0) NOT NULL,
  isactive CHAR(1) DEFAULT 'Y'::bpchar NOT NULL,
  created TIMESTAMP WITHOUT TIME ZONE DEFAULT now() NOT NULL,
  createdby NUMERIC(10,0) NOT NULL,
  updated TIMESTAMP WITHOUT TIME ZONE DEFAULT now() NOT NULL,
  updatedby NUMERIC(10,0) NOT NULL,
  record_id NUMERIC(10,0) NOT NULL,
  oldvalue VARCHAR(2000),
  newvalue VARCHAR(2000),
  undo CHAR(1),
  redo CHAR(1),
  iscustomization CHAR(1) DEFAULT 'N'::bpchar NOT NULL,
  description VARCHAR(255),
  ad_changelog_uu VARCHAR(36) DEFAULT NULL::character varying,
  CONSTRAINT adcolumn_adchangelog FOREIGN KEY (ad_column_id)
    REFERENCES adempiere.ad_column(ad_column_id)
    MATCH PARTIAL
    ON DELETE CASCADE
    ON UPDATE NO ACTION
    DEFERRABLE
    INITIALLY DEFERRED,
  CONSTRAINT adsession_adchangelog FOREIGN KEY (ad_session_id)
    REFERENCES adempiere.ad_session(ad_session_id)
    MATCH PARTIAL
    ON DELETE NO ACTION
    ON UPDATE NO ACTION
    DEFERRABLE
    INITIALLY DEFERRED,
  CONSTRAINT adtable_adchangelog FOREIGN KEY (ad_table_id)
    REFERENCES adempiere.ad_table(ad_table_id)
    MATCH PARTIAL
    ON DELETE CASCADE
    ON UPDATE NO ACTION
    DEFERRABLE
    INITIALLY DEFERRED
) 
WITH (oids = false);

CREATE INDEX ad_changelog_speed ON adempiere.ad_changelog
  USING btree (ad_table_id, record_id);

CREATE UNIQUE INDEX ad_changelog_uu_idx ON adempiere.ad_changelog
  USING btree (ad_changelog_uu COLLATE pg_catalog."default");

How should I actually implement a record history table with PostgreSQL?

1 Answers1