Converting a large set of data into C# objects

Question

I'm making a complex app (planning which involves: articles, sales, clients, manufacturing, machines...) that uses the information provided by the SQL Server database of an ERP.

I use about 30 different related objects, each of which has its info stored in a table/view. Some of this tables have from 20k to 100k records.

I need to convert all these tables into C# object for future processing that cannot be handled in SQL. I do not need all the rows, but there ins't a way to determine which ones will I need exactly, as this will depend on runtime events.

The question is about the best way to do this. I have tried the following approaches:

Retrieve all data and store it in a DataSet using a SqlDataAdapter, which ocuppies about 300mb in RAM. First problem here: sync, but it's admissable since data isn't going to change that much during execution.

Then I ran through every row and convert it to C# objects, stored in static Dictionaries for fast access through key. Problem with this is that creating so many objects (millions) takes the memory usage up to 1,4GB, which is too much. Besides from memory, data access is very fast.

So if getting all takes too much memory, I thought that I needed some kind of laxy loading, so I tried:

Another option I have considered is to query directly the database through a SqlDataReader filtering by the item I need only the first time it's required, then it's stored in the static dictionary. This way memory usage it's the minimum, but this way is slow (minutes order) as it means that I need to make like a millon different queries which the server doesn't seem to like (low performance).

Lastly, I tried an intermediate approach that kind of works, but I'm not sure if it's optimal, I suspect it's not:

A third option would be to fill a DataSet containing all the info and store a local static copy, but not convert all the rows to objects, just do it on demand (lazy), something like this:

public class ProductoTerminado : Articulo     {

  private static Dictionary<string, ProductoTerminado> productosTerminados = new Dictionary<string, ProductoTerminado>();

  public PinturaTipo pinturaTipo { get; set; }
  public ProductoTerminado(string id)
      : base(id) { }
  public static ProductoTerminado Obtener(string idArticulo)
  {
      idArticulo = idArticulo.ToUpper();
      if (productosTerminados.ContainsKey(idArticulo))
      {
        return productosTerminados[idArticulo];
      }
      else
      {
          ProductoTerminado productoTerminado = new ProductoTerminado(idArticulo);
          //This is where I get new data from that static dataset
          var fila = Datos.bd.Tables["articulos"].Select("IdArticulo = '" + idArticulo + "'").First();
          //Then I fill the object and add it to the dictionary.
          productoTerminado.descripcion = fila["Descripcion"].ToString();
          productoTerminado.paletizacion = Convert.ToInt32(fila["CantidadBulto"]);
          productoTerminado.pinturaTipo = PinturaTipo.Obtener(fila["PT"].ToString());
          productosTerminados.Add(idArticulo, productoTerminado);
          return productoTerminado;
      }
   }
}

So, is this a good way to proceed or should I look into Entity Framework or something like a strongly typed DataSet?

If you have a lot of data perhaps you should leave it in the database (option 2) instead of trying to create a local in memory database inside your app. When your app starts up surely the user doesn't need to see 1.4Gb of data straight up? What is on the primary data aware screen of your app? No one can make any sense of 1.4G of data in a grid. They usually need to analyse it first. Perhaps you could expalin further — Nick.McDermaid, Apr 29 '15 at 11:20
@Nick.McDermaid I need to get and relate all that data to process it in a way I cannot do it through SQL (temporal simulation, heuristic algoryhtms), then the result of this processing among with part of the information from the database its shown to the user in a variety of screens. — Pinx0, Apr 29 '15 at 11:22
You don't say how your application will be used, are there many people accessing the data simultaneously, or just one or two? Is the app used remotely from the location of your database or on the same network? In option 2 you say the "program is SLOW" but can you quantify what you mean by that? Are queries taking seconds, minutes? Without additional information it's hard to advise. — Tony, Apr 29 '15 at 11:22
@Tony It can be used by two people at the same time but information that the app provides it's almost read only, so staying out of sync it's not much of an issue. The database is in the same local network. Slow I mean that the millon of queries or so that requires the initialization takes several minutes (and probably causes performance problems to the rest of the company that uses the same database). — Pinx0, Apr 29 '15 at 11:25
@Pinx0 - If you are doing a complex simulation I would say your users are being unreasonable to expect results instantly. What's wrong with it taking some time? But more importantly I see you said "...process it in a way I cannot do it through SQL.." which leads me to ask, why are you using SQL to store your data? Have you explored other options? — Tony, Apr 29 '15 at 11:31
@Tony But isn't the simulation what takes long, is the data querying... and they have seen how fast it can get by using option 1, problem is memory usage as the app gets more complex. I use SQL because as stated in OP, the information which feeds the app comes from a ERP using SQL Server. — Pinx0, Apr 29 '15 at 11:33
@Pinx0 - Sorry, missed the part about the data feed. If you can't process any of the data in the SQL server you should probably remove the sql-server tag from your question. Last questions before I run out of ideas... Why do you say 1.4GB of RAM is too much to use? If you are using a 32 bit process I see your problem, but you can use [IMAGE_FILE_LARGE_ADDRESS_AWARE](https://msdn.microsoft.com/en-us/library/aa366778.aspx) to get more accessible RAM. And machines with 8GB are cheap :) — Tony, Apr 29 '15 at 11:44
@Tony If only I could decide on that... but I'm limited to 2GB PCs... so 1,4gb only for this app makes everything else practically unusable. Anyway the question was more about which option was the best way to proceed. — Pinx0, Apr 29 '15 at 11:46
Do you have a single query result that gives back all the data? Is this ERP system SAP? — danish, Apr 29 '15 at 12:12
I'm biased but.... are you certain you can't do it through SQL? SQL is good at crunching data but not so good at processing strings. — Nick.McDermaid, Apr 29 '15 at 12:25
@danish no, not all. Maybe i could make some joins and reduce the number of queries but that would give me a lot of repeated fields increasing the size of the data. — Pinx0, Apr 29 '15 at 13:37
@Nick.McDermaid I'm sure. The processing I do through sequential algorithms once I have the data in objects has no equal in SQL, at least not without being a SQL guru. — Pinx0, Apr 29 '15 at 13:38
So far I'm not convinced. Is it set based? Are you looking up, cross referencing and aggregating data? When you only use a hammer everything looks like a nail (of course it goes both ways - I like to think you can do most stuff in SQL instead of a custom built application layer) — Nick.McDermaid, Apr 30 '15 at 00:43
You've pigeon-holed yourself here. Make a new question describing what you're trying to do with the data. — Cory Nelson, Apr 30 '15 at 02:43
@Nick.McDermaid It's not just looking and aggregating... it's a custom made MPS and MRP generator, also connects with R.NET for statistical calculations. That's not really the point of this question. That works. I only need to convert all that data into C# objects. But only the data I need, which I don't know exactly forehand (so I can't just put a where clause). — Pinx0, Apr 30 '15 at 07:13
Well, so you may want to rethink your queries, may be a CLR based procedure if needed. You may even have to rethink the way you have modelled objects and the way you are manipulating them. Looks like a design exercise. Given the information, there is no concrete answer to this question. IMHO, looping through reader with yield is rather quick. — danish, Apr 30 '15 at 10:38
@danish well there's where I struggle. I haven't studied anything of desing, I have learnt online and by trial and error. Take a look at the edit I made, as an example of a class I use. In the get method, I look if that object (key) has already been created and if not, I get it from the data. The example is from "solution" number 3, the data is stored locally in a dataset, but the objects are created only on demand. For "solution" 2 I would just have to replace that line and include the SqlReader. The question is if that's well done or I'm pointing at the wrong way. — Pinx0, Apr 30 '15 at 11:11
Too broad. Data processing performance depends on way too many factors to have broad-brush answers. — Adam Houldsworth, Apr 30 '15 at 11:13
How bout just wrapping EF on top of sql server and just taking advantage of its first level cache by having a long lived db context since you stated "staying out of sync is not an issue"? — Janne Matikainen, Apr 30 '15 at 11:17
@JanneMatikainen that looks promising, says I can get the data as strongly typed objects, interesting. I will look further into it. — Pinx0, Apr 30 '15 at 11:22
Well assuming that you're doing matrix calculations, then this guys says don't bother... so I withdraw my suggestion https://www.simple-talk.com/sql/t-sql-programming/matrix-math-in-sql/ — Nick.McDermaid, Apr 30 '15 at 11:37
`I need to convert all these tables into C# object for future processing that cannot be handled in SQL. I do not need all the rows, but there ins't a way to determine which ones will I need exactly, as this will depend on runtime events.` You seem to be approaching the problem the wrong way. Even if you have an existing database, try thinking from the business/domain perspective, and then mapping those domain objects to the database tables. What you probably need is domain-driven design. — publicgk, May 07 '15 at 05:17

score 1 · Answer 1 · answered Apr 30 '15 at 11:31

I use relations between about 30 different objects, each of which has its info stored in a table/view. Some of this tables have from 20k to 100k records.

I suggest making a different decision for different types of objects. Usually, the tables that have thousands of records are more likely to change. Tables that have fewer records are less likely. In a project I was working on the decision was to cache in a List<T> the objects that don't change (on start-up). For a few hundred instances this should take well less than a second.

If you are using linq-to-sql, have an object local in a List<T> and have correctly defined the FK constraints, you can do obj.Items to access the Items table filtered by obj's ID. (In this example obj is the PK and Items is the FK table).

This design will also give users the performance they expect. When working on small sets everything is instantaneous (cached). When working on larger sets but making small selects or inserts - performance is good (quick queries that use the PK). You only really suffer when you start doing queries that join multiple big tables; and in those cases, users will probably expect this (though I can't be certain without knowing more about the use case).

Converting a large set of data into C# objects

1 Answers1