24

Suppose you were setting up a database to store crash test data of various vehicles. You want to store data of crash tests for speedboats, cars, and go-karts.

You could create three separate tables: SpeedboatTests, CarTests, and GokartTests. But a lot of your columns are going to be the same in each table (for example, the employee id of the person who performed the test, the direction of the collision (front, side, rear), etc.). However, plenty of columns will be different, so you don't want to just put all of the test data in a single table because you'll have quite a few columns that will always be null for speedboats, quite a few that will always be null for cars, and quite a few that will always be null for go-karts.

Let's say you also want to store some information that isn't directly related to the tests (such as the employee id of the designer of the thing being tested). These columns don't seem right to put in a "Tests" table at all, especially because they'll be repeated for all tests on the same vehicle.

Let me illustrate one possible arrangement of tables, so you can see the questions involved.

Speedboats
id | col_about_speedboats_but_not_tests1 | col_about_speedboats_but_not_tests2

Cars
id | col_about_cars_but_not_tests1 | col_about_cars_but_not_tests2

Gokarts
id | col_about_gokarts_but_not_tests1 | col_about_gokarts_but_not_tests2

Tests
id | type | id_in_type | col_about_all_tests1 | col_about_all_tests2
(id_in_type will refer to the id column of one of the next three tables,
depending on the value of type)

SpeedboatTests
id | speedboat_id | col_about_speedboat_tests1 | col_about_speedboat_tests2

CarTests
id | car_id | col_about_car_tests1 | col_about_car_tests2

GokartTests
id | gokart_id | col_about_gokart_tests1 | col_about_gokart_tests2

What is good/bad about this structure and what would be the preferred way of implementing something like this?

What if there's also some information that applies to all vehicles that you'd prefer to have in a Vehicles table? Would the CarTests table then look something like...

id | vehicle_id | ...

With a Vehicles table like this:
id | type | id_in_type
(with id_in_type pointing to the id of either a speedboat, car, or go-kart)

This is just getting to be a royal mess it seems. How SHOULD something like this be set up?

Eric Lavoie
  • 3,671
  • 2
  • 25
  • 46
Instance Hunter
  • 7,621
  • 5
  • 39
  • 52
  • 1
    possible duplicate of [How do you effectively model inheritance in a database?](http://stackoverflow.com/questions/190296/how-do-you-effectively-model-inheritance-in-a-database) – Musa Haidari Aug 25 '14 at 13:45

6 Answers6

40

The type and id_in_type design is called Polymorphic Associations. This design breaks rules of normalization in multiple ways. If nothing else, it should be a red flag that you can't declare a real foreign key constraint, because the id_in_type may reference any of several tables.

Here's a better way of defining your tables:

  • Make an abstract table Vehicles to provide an abstract reference point for all vehicle sub-types and vehicle tests.
  • Each vehicle sub-type has a primary key that does not auto-increment, but instead references Vehicles.
  • Each test sub-type has a primary key that does not auto-increment, but instead references Tests.
  • Each test sub-type also has a foreign key to the corresponding vehicle sub-type.

Here's sample DDL:

CREATE TABLE Vehicles (
 vehicle_id INT AUTO_INCREMENT PRIMARY KEY
);

CREATE TABLE Speedboats (
 vehicle_id INT PRIMARY KEY,
 col_about_speedboats_but_not_tests1 INT,
 col_about_speedboats_but_not_tests2 INT,
 FOREIGN KEY(vehicle_id) REFERENCES Vehicles(vehicle_id)
);

CREATE TABLE Cars (
 vehicle_id INT PRIMARY KEY,
 col_about_cars_but_not_tests1 INT,
 col_about_cars_but_not_tests2 INT,
 FOREIGN KEY(vehicle_id) REFERENCES Vehicles(vehicle_id)
);

CREATE TABLE Gokarts (
 vehicle_id INT PRIMARY KEY,
 col_about_gokarts_but_not_tests1 INT,
 col_about_gokarts_but_not_tests2 INT,
 FOREIGN KEY(vehicle_id) REFERENCES Vehicles(vehicle_id)
);

CREATE TABLE Tests (
 test_id INT AUTO_INCREMENT PRIMARY KEY,
 col_about_all_tests1 INT,
 col_about_all_tests2 INT
);

CREATE TABLE SpeedboatTests (
 test_id INT PRIMARY KEY,
 vehicle_id INT NOT NULL,
 col_about_speedboat_tests1 INT,
 col_about_speedboat_tests2 INT,
 FOREIGN KEY(test_id) REFERENCES Tests(test_id),
 FOREIGN KEY(vehicle_id) REFERENCES Speedboats(vehicle_id)
);

CREATE TABLE CarTests (
 test_id INT PRIMARY KEY,
 vehicle_id INT NOT NULL,
 col_about_car_tests1 INT,
 col_about_car_tests2 INT,
 FOREIGN KEY(test_id) REFERENCES Tests(test_id),
 FOREIGN KEY(vehicle_id) REFERENCES Cars(vehicle_id)
);

CREATE TABLE GokartTests (
 test_id INT PRIMARY KEY,
 vehicle_id INT NOT NULL,
 col_about_gokart_tests1 INT,
 col_about_gokart_tests2 INT,
 FOREIGN KEY(test_id) REFERENCES Tests(test_id),
 FOREIGN KEY(vehicle_id) REFERENCES Gokarts(vehicle_id)
);

You could alternatively declare Tests.vehicle_id which references Vehicles.vehicle_id and get rid of the vehicle_id foreign keys in each test sub-type table, but that would permit anomalies, such as a speedboat test that references a gokart's id.

Bill Karwin
  • 462,430
  • 80
  • 609
  • 762
  • This was extremely helpful and thorough. Thank you! – Instance Hunter Feb 16 '09 at 22:11
  • 2
    all other answers except this one and, maybe, [the one mentioning Martin Fowler](http://stackoverflow.com/a/554552/279564), should be removed or buried into oblivion... OMG.. – Rafa Jul 03 '12 at 22:10
  • This is the Class Table Inheritance approach. Other alternatives are numerated here: http://stackoverflow.com/a/3579462/200987 – oligofren Jul 28 '15 at 08:48
  • There is one issue with this: You can have a car and a gocart with the same vehicle_id. This is however an unsolvable drawback of the relational model. There is an alternative method (As used by Laravel and Ruby): reverse the relationship, use both "vehicle_id" and "vehicle_type". You gain uniqueness, but sacrifice the FK constraint because vehicle_id now points at multiple tables -- in other words, you might have subtype-less base types, but you won't have multiple subtypes per type. Which solution is right depends on which "safety feature" you prefer. – okdewit Nov 10 '19 at 13:42
  • @Fx32 Every experienced database developer I know avoids using foreign keys anyway. – Bill Karwin Nov 10 '19 at 17:32
  • @BillKarwin Could you explain what you mean? Without FKs, how would you ensure data integrity? Not every type of constraint is possible in the relational model (most notably, polymorphism), so there have to be compromises. But in my experience, anything that isn't enforced at the DB level, will eventually become tainted with invalid data -- backend code is usually too dynamic. – okdewit Nov 12 '19 at 14:25
  • @Fx32 See this [comment](https://github.com/github/gh-ost/issues/331#issuecomment-266027731) from a senior engineer at github. Many MySQL DBAs agree with his reasons. On the other hand, you're right that removing FK constraints creates a risk for your app. You have to ensure that your app code is designed to "do the right thing." But also run data cleanup jobs for the inevitable mistakes. – Bill Karwin Nov 12 '19 at 15:03
  • @BillKarwin Would you say you changed the viewpoint you expressed in SQL Antipatterns 7.4: "...use constraints like foreign keys to ensure referential integrity. Polymorphic Associations often relies too much on application code instead of meta-data". I'm dealing with a polymorphic relationship which *needs* integrity checks, and wondering if I should encapsulate the checks in code, or use SQL update/insert/delete triggers to emulate a more "complete" referential checking system. – okdewit Nov 12 '19 at 23:34
  • @okdewit, Yes, I've changed my opinion on that. It's true that without FK constraints, you have a risk of data integrity breaking down. Ideally, we wouldn't compromise that. But in real-world scenarios, businesses put a higher priority on maximizing query throughput. Many times I've given presentations on data modeling, and the only question is, "yeah, but how well does that perform?" The bottom line is that performance is more important to most people. I'm kind of sad to say that, but I have to acknowledge it. – Bill Karwin Nov 12 '19 at 23:45
  • @BillKarwin In my opinion, it would be best to use a safe, normalized and "somewhat slow" main database to write "absolute truth" to, and then denormalize that data into caches (elastic, redis, bigquery, etc) when performance is important. Compromising the single source of truth for performance is not something I'd be willing to do. – okdewit Nov 13 '19 at 00:35
  • @okdewit Good luck with that. :-) – Bill Karwin Nov 13 '19 at 01:06
  • 2
    Those who would give up essential integrity, to purchase a little temporary performance, deserve neither integrity nor performance. – Neil Apr 08 '20 at 11:02
14

For mapping inheritance hierarchies to database tables, I think Martin Fowler lays out the alternatives fairly well in his book Patterns of Enterprise Application Architecture.

http://martinfowler.com/eaaCatalog/singleTableInheritance.html

http://martinfowler.com/eaaCatalog/classTableInheritance.html

http://martinfowler.com/eaaCatalog/concreteTableInheritance.html

If the number of additional fields/columns is small for subclasses, then single table inheritance is usually the simplest to deal with.

If you're using PostgreSQL for your database and you're willing to tie yourself to a database-specific feature, it supports table inheritance directly:

http://www.postgresql.org/docs/8.3/static/ddl-inherit.html

hallidave
  • 7,969
  • 4
  • 29
  • 26
  • I'd add that with specific reference to the royal mess alluded to in the original question that the foreign key would point from the specific vehicle type to the abstract vehicle table. ie speedboat (vehicle_id FK, speedboat_specific_column1, etc...) – Robin Feb 16 '09 at 21:36
0

I would break it up into different tables, e.g. Vehicle (ID, type, etc) VehicleAttributes ()VehicleID, AttributeID, Value), CrashTestInfo(VehicleID, CrashtestID, Date etc.) CrashtestAttributes(CrashTestID, AttributeID, Value)

Or rather than attributes, separate tables for each set of similar detail that should be recorded.

cjk
  • 43,338
  • 9
  • 74
  • 109
0

If you are using SQLAlchemy, an object-relational mapper for Python, you can configure how inheritance hierarchies are mapped to database tables. Object-relational mappers are good for taming otherwise tedious SQL.

Your problem might be a good fit for vertical tables. Instead of storing everything in the schema, store the object's type and primary key in one table and key/value tuples for each object in another table. If you really were storing car tests, this setup would make it much easier to add new kinds of results.

joeforker
  • 36,731
  • 34
  • 138
  • 231
-1

Do a google search on "gen-spec relational modeling". You'll find articles on how to set up tables that store the attributes of the generalized entity (what OO programmers might call the superclass), separate tables for each of the specialized entities (subclasses), and how to use foreign keys to link it all together.

The best articles, IMO, discuss gen-spec in terms of ER modeling. If you know how to translate an ER model into a relational model, and thence to SQL tables, you'll know what to do once they show you how to model gen-spec in ER.

If you just google on "gen-spec", most of what you'll see is object oriented, not relational oriented. That stuff may be useful as well, as long as you know how to overcome the object relational impedance mismatch.

Walter Mitty
  • 16,496
  • 2
  • 24
  • 53
-3

Your design is reasonable and is following the correct normalization rules. You might be missing a Vehicle table with a Vehicle Id and Type (ie the "parent" for Speedboats, Cars, and Gokarts... where you'd keep stuff like "DesignedByUserId"). Between the Vehicle table and the Speedboats table is a one - to - one relationship, and between Vehicle and Speedboat/Cars/GoKarts there is a 1-and-only-1 relationship (ie. a vehicle can only have 1 record for speedboat, cars or go karts)... though most db's don't offer an easy enforcement mechanism for this.

One normalization rule that helps identify these sorts of things is that a field should depend only upon the primary key of the table. In a consolidated table where speedboat, cars, and gokart test results are stored together then the cars related fields depend not only on the test date but also on the vechicle id and vehicle type. The primary key for the test results table is test date + vehicle id, and vehicle type isn't what makes the test data row unique (ie. is there anyway to conduct a test on 01/01/200912:30pm on one specific vehicle that is both a speedboat and car... nope... can't be done).

I'm not explaining the normalization rule particularily well... but 3rd/4th/5th normal forms rules always confuses me when I read the formal descriptions. One of those (3rd/4th/5th) deals with fields depending upon the primary key and only the primary key. The rule make the assumption that the primary key has been correctly identified (incorrectly defininh the primary key is far too easy to do).

user53794
  • 3,710
  • 2
  • 28
  • 31
  • 1
    -1 because the Polymorphic Associations design (the `type` and `id_in_type` thing) is *not* a normalized design. – Bill Karwin Feb 16 '09 at 21:28
  • Uhmm... see http://en.wikipedia.org/wiki/Fourth_normal_form. The pizza example is fairly reasonable. – user53794 Feb 16 '09 at 22:13
  • 2
    You're saying {test_id, type} -> -> {id_in_type} passes 4NF, therefore {test_id, type} is a superkey? I'm talking about the basic definition of a relation, in which each attribute represents a value for one "thing" -- but id_in_type is three different kinds of things. – Bill Karwin Feb 17 '09 at 01:41