4

What I'm doing

I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).

I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.


Where I'm at now

Right now I am thinking to create the following tables.

 TABLE:  Schemas
   COLUMN NAME       DATA TYPE
   SchemaId          uniqueidentifier
   Xsd               xml                 //contains the schema for the document of the given complex type
   DeserializeType   varchar(200)        //The Full Type name of the C# class to which the document deserializes.

 TABLE:  Documents
   COLUMN NAME       DATA TYPE      
   DocumentId        uniqueidentifier
   SchemaId          uniqueidentifier

 TABLE:  Values                            //The DocumentId+ValueXPath function as a PK
   COLUMN NAME       DATA TYPE      
   DocumentId        uniqueidentifier
   ValueXPath        varchar(250)
   Value             text

from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.


What I want to know

I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:

1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist

I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.


For Example,

Here's an example of what I am currently planning on doing.

I have a C# class Address which looks like

public class Address{
     string Line1 {get;set;}
     string Line2 {get;set;}
     string City {get;set;}
     string State {get;set;}
     string Zip {get;set;
}

An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}

its XML value would be look like.

<Address>
   <Line1>17 Mulberry Street</Line1>
   <Line2>Apt C</Line2>
   <City>New York</City>
   <State>NY</State>
   <Zip>10001</Zip>
</Address>

Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.

They would look like:

DocumentId                              ValueXPath        Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/Line1    17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/Line2    Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/City     New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/State    NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/Zip      10001

Just Added a Bounty...

My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.

I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.

smartcaveman
  • 38,142
  • 26
  • 119
  • 203
  • 2
    You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for.... Totally serious. Just drop the bounty and spend your time working on how your application can work out which tables/indedes to create and when. – Ben Mar 10 '11 at 19:11
  • @Ben, that makes a lot of sense. The part I don't like is that it introduces a lot more complexity into the associations, but I suppose I can auto generate these as well. – smartcaveman Mar 10 '11 at 19:28
  • The main problem is that you have not separated Data and Objects in a clean and functional manner; they are quite different, and require quite different design methodologies. Once you apply them, you will have high performance on both sides. But approaching it the way you do, with an Object or XML mindset, and the db as a storage location only, is a guaranteed performance and maintenance disaster. – PerformanceDBA Mar 13 '11 at 05:24
  • @PerformanceDBA, that's kinda why I asked the question bro – smartcaveman Mar 13 '11 at 13:55
  • @smartcaveman. In that case, ditch the Objects and XML, and post everything you know about the Data, only. Check my answer to **[this question](http://stackoverflow.com/q/4824765/484814)** and **[this question](http://stackoverflow.com/questions/4304217/database-schema-which-can-support-specialized-properties/4359193#4359193)** – PerformanceDBA Mar 14 '11 at 08:50
  • @PerformanceDBA, What is your opinion on CQRS? – smartcaveman Mar 14 '11 at 17:13
  • @smartcaveman. It is irrelevant, because it is based on the notion that you need de-normalised vs normalised databases, which is false. Justifying two databases (and an *additional* CQRS layer) is completely insane. It is simple, and much easier and faster to have a correctly normalised Relational Db. **[This answer](http://stackoverflow.com/questions/4394183/should-not-olap-database-be-denormalized-for-reading-performance/4731664#4731664)** is related. – PerformanceDBA Mar 15 '11 at 00:11
  • @PerformanceDBA, thank you for all the resources. It seems like you are saying my general approach is incorrect, and that the question itself is problematic. My objective is to create a generic solution, but it seems like you are saying it is not a good approach to do so. I understand that there will be some performance hit with any generic solution. However, I'm not convinced this problem has to outweigh the benefit of a generic persistence solution with respect to real-world business scenarios. I am going to research some of the points you made and I appreciate the quality information. – smartcaveman Mar 15 '11 at 12:57
  • @PerformanceDBA, What guidelines would you suggest for deciding whether a scenario is better suited to a document database or a relational database? – smartcaveman Mar 15 '11 at 13:02
  • @smartcaveman. (Offline for a week) My pleasure. Well in this day and age, I would place the docs in a relational db, as blobs. You get all the power of Relational plus docs. Most have a Full Text Search option, if you require that.That works brilliantly for IEC/ISO/ANSI Standard SQL, enterprise class engines; but not for the non-compliant or pretend SQLs. I have not seen a justification for a document db that is truly valid (they just do one thing well, and most things badly). – PerformanceDBA Mar 22 '11 at 10:44

5 Answers5

3

How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.

[start evangelist mode]

  • You can go document-based or relational as storage. Even both! (Event Sourcing)
  • Nice separation of concerns: Read Model vs Write Model
  • Have your cake and eat it too!

Ok, there is an initial learning / technical curve to get over ;)

[end evangelist mode]

As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".

There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.

See

Community
  • 1
  • 1
Derick Schoonbee
  • 2,793
  • 1
  • 20
  • 36
  • I am interested in learning more about CQRS. Can you recommend any more complete resources? – smartcaveman Mar 15 '11 at 13:00
  • There is a nice 6 hr detailed video from Greg Young on cqrsinfo.com .. a "must watch" even for people new to CQRS. Then the [google group](http://groups.google.com/group/dddcqrs) is great. – Derick Schoonbee Mar 16 '11 at 07:35
2

Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.

If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.

Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.

So, my advise:

Addendum, per your explanations above

Simply put, don't go over the top with this thing:

  • If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
  • If you, for some reason, need to store complex XML, use a dedicated XML store
  • If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
  • If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all

As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.

If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.

Community
  • 1
  • 1
averell
  • 3,672
  • 2
  • 18
  • 27
  • I have considered something like this, but I am not very informed on the topic. Currently reading the resources you provided. Do you know anything about the relative performance in these solutions? – smartcaveman Feb 23 '11 at 14:43
  • Again, it depends on the query you want to do and the particular engine; but they will usually be reasonably performant. With a document store, you will usually denormalize your data, thus keeping the queries simple and fast. There are also some people who use document stores as a "cache" to do queries on data that originally comes from very complex SQL tables. I suggest you read up on the different solutions to see what fits you best. – averell Feb 23 '11 at 15:34
1

I need to be able to create variable schemas on the fly without changing anything about the database access layer.

You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....

I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.

Edit: Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).

Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.

However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.

I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.

I guess what i am saying is "are you sure you need to do this and XML can't just do the job".

Ben
  • 32,512
  • 6
  • 68
  • 102
  • this makes a lot of sense. How would you handle updating the model schema and maintaining valid data state with this approach? – smartcaveman Mar 15 '11 at 13:00
  • thanks - I am probably going to need better performance than the SQLXML allows for. I'm currently looking into some different open source solutions for either a document database optimized for queries or something similar to what you initially suggested. – smartcaveman Mar 16 '11 at 10:24
0

In part, it will depend of your DB Engine. You're using SQL Server, don't you?

Answering your topics:

1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.

2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.

"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link

3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.

4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.

I hope I've helped.

Alex
  • 3,209
  • 10
  • 47
  • 75
  • About 1) - The issue is not so much the size but the fact that text fields have variable size in the db and that makes it harder for the db to find stuff in the table (simply said). That's why you should have them in a separate table that you don't use for "normal" queries. And although some databases offer some indexing for fulltext searches, you're probably much better off using a dedicated fulltext search engine for such a task. – averell Feb 24 '11 at 08:56
0

Interesting question.

I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.

However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.

I would imagine you would need queries like:

  • "find all addresses in the states of New York, New Jersey and Pennsylvania"
  • "find all addresses between house numbers 1 and 100 on Mulberry Street"
  • "find all addresses where the zipcode is missing, and the city is New York"

At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.

Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.

Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".

Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.

Neville Kuyt
  • 27,150
  • 1
  • 34
  • 48