Strategies for Class/Schema aware test data generation for Data Driven Tests

Question

I've recently started pushing for TDD where I work. So far things are going well. We're writing tests, we're having them run automatically on commit, and we're always looking to improve our process and tools.

One thing I've identified that could be improved is how we set up our Test Data. In strictly unit tests, we often find ourselves instantiating and populating complex CLR objects. This is a pain, and typically the test is then only run on a handful of cases.

What I'd like to push for is Data Driven tests. I think we should be able to load our test data from files or maybe even generate them on the fly from a schema (though I would only consider doing it on the fly if I could generate every possible configuration of an object, and that number of configurations was small). And there is my problem.

I have yet to find a good strategy for generating test data for C# CLR objects.

I looked into generating XML data from XSDs and then loading that into my tests using the DataSourceAttribute. The seemed like a good approach, but I ran into troubles generating XSD files. xsd.exe falls over because our classes have interface members. I also tried using svcutil.exe on our assembly, but because our code is monolithic the output is huge and tricky (many interdependent .xsd files).

What are other techniques for generating test data? Ideally the generator would follow a schema (maybe an xsd, but preferably the class itself), and could be scripted. Technical notes (not sure if this is even relevant, but it can't hurt):

We're using Visual Studio's unit testing framework (defined in Microsoft.VisualStudio.TestTools.UnitTesting).
We're using RhinoMocks

Thanks

Extra Info

One reason I'm interested in this is to test an Adapter class we have. It takes a complex and convoluted legacy Entity and converts it to a DTO. The legacy Entity is a total mess of spaghetti and can not be easily split up into logical sub-units defined by interfaces (as suggested). That would be a nice approach, but we don't have that luxury.

I would like to be able to generate a large number of configurations of this legacy Entity and run them through the adapter. The larger the number of configurations, the more likely my test will fail when the next developer (oblivious to 90% of the application) changes the schema of the legacy Entity.

UPDATE

Just to clarify, I am not looking to generate random data for each execution of my tests. I want to be able to generate data to cover multiple configurations of complex objects. I want to generate this data offline and store it as static input for my tests.

I just reread my question and noticed that I had in fact originally ask for random on the fly generation. I'm surprised I ask for that! I've updated the question to fix that. Sorry about the confusion.

The entity is laced with complex and contradicting business logic. This logic is spread out between around 50 properties. The order in which properties are accessed can affect the values they contain. it's awful. Because it's so prone to bugs (and because the company I'm working at only hires junior devs) new developers frequently change its inner workings. And we have no automated tests running on the legacy code... — MetaFight, Dec 20 '13 at 14:13
I'm working on an integration project that relies on the legacy application. We're updating the legacy app to publish business-relevant events. I want the integration code to be as resistant to regressions as possible, so I managed to convince management to let me do TDD and to invest in CI. That means that the integration code is safer, but the rest of the legacy application is still a nightmare. — MetaFight, Dec 20 '13 at 14:14
From your updates it sounds essentially one part of your application is pretty nasty and you are trying to hide it behind an interface. What makes it difficult to set up this 'entity'? does it have database dependencies? Or does it depend on other external resources that mean you cant configure/use it easily from tests? — JonnyRaa, Dec 20 '13 at 16:03

score 8 · Answer 1 · answered Dec 03 '13 at 09:14

What you need is a tool such as NBuilder (http://code.google.com/p/nbuilder).

This allows you to describe objects, then generate them. This is great for unit testing.

Here is a very simple example (but you can make it as complex as you want):

var products = Builder<Product>
                   .CreateListOfSize(10)
                   .All().With(x => x.Title = "some title")
                   .And(x => x.AnyProperty = RandomlyGeneratedValue())
                   .And(x => x.AnyOtherProperty = OtherRandomlyGeneratedValue())
                   .Build();

Daniel Mann · Answer 2 · 2013-12-20T13:52:52.470

In my experience, what you're looking to accomplish ends up actually being harder to implement and maintain than generating objects in code on a test-by-test basis.

I worked with a client that had a similar issue, and they ended up storing their objects as JSON and deserializing them, with the expectation that it would be easier to maintain and extend. It wasn't. You know what you don't get when editing JSON? Compile-time syntax checking. They just ended up with tests breaking because of JSON that failed to deserialize due to syntax errors.

One thing you can do to reduce your pain is to code to small interfaces. If you have a giant object with a ton of properties, a given method that you'd like to test will probably only need a handful. So instead of your method taking SomeGiantClass, have it take a class that implements ITinySubset. Working with the smaller subset will make it much more obvious what things need to be populated in order for your test to have any validity.

I agree with the other folks who have said that generating random data is a bad idea. I'd say it's a really bad idea. The goal of unit testing is repeatability, which goes zooming out the window the second you generate random data. It's a bad idea even if you're generating the data "offline" and then feeding it in. You have no guarantee that the test object that you generated is actually testing anything worthwhile that's not covered in other tests, or if it's testing valid conditions.

More tests doesn't mean that your code is better. 100% code coverage doesn't mean that your code is bug-free and working properly. You should aim to test the logic that you know matters to your application, not try to cover every single imaginable case.

score 2 · Answer 3 · answered Dec 19 '13 at 20:49

2

This is a little different then what you are talking about, but have you looked at Pex? Pex will attempt to generate inputs that cover all of the paths of your code.

http://research.microsoft.com/en-us/projects/Pex/

answered Dec 19 '13 at 20:49

mmilleruva

1,980
16
20

JonnyRaa · Answer 4 · 2013-12-20T17:21:27.330

Generating test data is often an inappropriate and not very useful way of testing - particuarly if you are generating a different set of test data (eg randomly each time) as sometimes a test run will fail and sometimes it wont. It also may be totally irrelevant to what your doing and will make for a confusing group of tests.

Tests are supposed to help document + formalise the specification of a piece of software. If the boundaries of the software are found through bombarding the system with data then these wont be documented properly. They also provide a way of communicating through code that is different from the code itself and as a result are often most useful if they are very specific and easy to read and understand.

That said if you really want to do it though typically you can write your own generator as a test class. I've done this a few times in the past and it works nicely, with the added bonus that you can see exactly what it's doing. You also already know the constraints of the data so there's no problem trying to generalise an approach

From what you say the pain you are having is in setting up objects. This is a common testing issue - I'd suggest focusing on that by making fluent builders for your common object types - this gives you a nice way of filling in less detail every time (you typically would provide only the interesting data (for a given test case) and have valid defaults for everything else). They also reduce the number of dependencies on constructors in test code which means your tests are less likely to get in the way of refactoring later on if you need to change them. You can really get a lot of mileage out of that approach. You can further extend it by having common setup code for builders when you get a lot of them that is a natural point for developers to hang reusable code.

In one system I've worked on we ended up aggregating all these sorts of things together into something which could switch on + off different seams in the application (file access etc), provided builders for objects and setup a comprehensive set of fake view classes (for wpf) to sit on top of our presenters. It effectively provided a test friendly interface for scripting and testing the entire application from very high-level things to very low-level things. Once you get there you're really in the sweet spot as you can write tests that effectively mirror button clicks in the application at a very high level but you have very easy to refactor code as there are few direct dependencies on your real classes in the tests

Thanks for your input. Just to clarify: I don't want the test data to be generated randomly on the fly. I want to be able to generate this data offline and set it as a test case. The reason I want an offline generator (or at least define the data offline) is that the test I want to run is for an *Adapter*. It takes a complex and convoluted legacy Entity, and converts it to a simper DTO. I want my test to fail when the schema of the legacy Entity changes. That way I can double check the adapter still works correctly. So, in summary, I want a deterministic, schema aware, value generator. — MetaFight, Dec 20 '13 at 13:23
sort of just sounds like a smoke test for loading some information from an old format? Dont you have any data kicking about in the old format? Entity sounds like you're talking about a database - could you provide some more specifics about what you are doing? — JonnyRaa, Dec 20 '13 at 13:51
I've updated the question. This is not about migrating an old format to a new format. This is part of an integration project between a brittle legacy system (with the legacy Entity) and a new system. We're updating the legacy system to broadcast business-relevant events. I'm trying to write a suite of tests that will help make the integration more regression resistant to future developers. The adapter is there to lessen the tight coupling with a class that changes frequently. This means I need good tests on the Adapter. — MetaFight, Dec 20 '13 at 14:08

score 1 · Answer 5 · edited May 23 '17 at 10:32

Actually, there is a Microsoft's way of expressing object instances in markup, and that is XAML.

Don't be scared with the WPF paradigm in the documentation. All you need to do is use correct classes in unit tests to load the objects.

Why I would do this? because Visual Studio project will automatically give you XAML syntax and probably intellisense support when you add this file.

What would be a small problem? markup element classes must have parameterless constructors. But that problem is always present and there are workarounds (e.g. here).

For reference, have a look at:

I wish I could show you something done by me on this matter, but I can't.

Strategies for Class/Schema aware test data generation for Data Driven Tests

5 Answers5