8

Primarily this seems to be a technique used by games, where they have all the sounds in one file, textures in another etc. With these files commonly reaching the GB size.

What is the reason behind doing this over maintaining it all in subdirectories as small files - one per texture which many small games use this, with the monolithic system being favoured by larger companies?

Is there some file system overhead with lots of small files? Are they trying to protect their property - although most just seem to be a compressed file with a new extension?

Ali Lown
  • 2,169
  • 16
  • 22

5 Answers5

7

The reasons we use an "archive" system like this where I work (a game development company):

  • lookup speed: We rarely need to iterate over files in a directory; we're far more often looking them up directly by name. By using a custom "file allocation table" that is essentially just a sequence of hash( normalized_filename ) -> [ offset, size ], we can look up files very quickly. We can also keep this index in RAM, potentially interleave it with other index tables, etc.
  • (When we do need to iterate, we can either easily iterate over all files in a .arc, or we can store a list of filenames, a list of hash-of-filenames, or just a list of [ offset, size ] pairs somewhere -- maybe even as a file in the archive. This is usually faster than directory-traversal on a FS.)
  • metadata: It's easy for us to tuck in any file metadata we want. For example, a single bit in the "size" field indicates whether the file is compressed or not (if it is, it has a header with more details about how to decompress it). We can even vary compression on pieces of a file if we know enough about the structure of the file ahead of time (we do this for sprite archives).
  • size: One of the devices we use has a "file size must be a multiple of X" requirement, where X is large compared to some of our files. For example, some of our lua scripts end up being just a few hundred bytes when compiled; taking extra overhead per .luc file adds up quickly.
  • alignment: on the other hand, sometimes we want to waste space. To take advantage of faster streaming (e.g. background DMA) from the filesystem, some of our files do want to obey certain alignment/size requirements. We can take care of that right in the tool, and the align/size we're shooting for doesn't necessarily have to line up with the underlying FS, allowing us to waste space only where we need it.

But those are the mundane reasons. The more fun stuff:

Each .arc registers in a list, and attempts to open a file know to look in the arcs. We search already-in-RAM archives first, then archives on the device FS, then the actual device FS. This gives us a ton of flexibility:

  • dynamic additions to the filesystem: at any time we can stream a new file or archive to the machine in question (over the network or the like) and have it appear as part of the "logical" filesystem; this is great when the actual FS resides in ROM or on a CD, and allows us to iterate much more quickly than we could otherwise.
  • (Doom's .wad system is a sort of example of the above, which allows modders to more easily override assets and scripts built into the game.)
  • possibility of no underlying fs: It's possible to use bin2obj to embed an entire arc directly in the executable (.rodata) at link time, at which point you don't ever need to look at the device FS -- we do this for certain small demo builds and the like. We can also send levels across the network or savegame-sneakernet this way. =)
  • organization and load/unload: since we can load and unload and override virtual "pieces" of our filesystem at any time, we can do some performance tricks with having the number of files in the FS be very small at any given time. We can additionally specify that an entire archive be loaded into memory, index table and data; our file load code is smart enough to know that if the file is already in memory, it doesn't need to do anything to read it other than move a pointer around. Some of the higher level code can actually detect that the file is in ram and just ask for the probably-already-looks-like-a-struct pointer directly.
  • portability: we only need to figure out how to get a few files on each new device we use, and then the remainder of the FS code is more or less the same. =) We do change the tool output a bit occasionally (for alignment reasons), but most of the processing remains the same.
  • de-duplication: with smarter archives, such as our sprite archives, we can (and do) de-duplicate data. If "jump" animation's fifth frame and "kick"'s third frame are the same, we can pull apart the file and only store one copy of that frame. We can do the same for whole files.

We ported a PC game to a system with much slower FS access recently. We didn't change the data format, and it turns out iterating through a dir on the raw device FS to load a hundred small XML files was absolutely killing our load times. The solution we used was to take each dir, make it into its own subdir.arc, and stick it in the master game.arc compressed. When the dir was needed (something like opendir was called) we decompressed the entire subdir.arc into RAM, added it to the filesystem, then iterated through it super-quickly.

It's the ability to throw something like this together in a few hours, and to ease the pain of porting across systems, that makes stuff like this worthwhile.

leander
  • 8,287
  • 1
  • 27
  • 42
  • haha +1 i like this answer. I completely forgot about alignment, chunks and streaming pieces in. The other reasons are cool too. –  May 22 '10 at 23:20
  • +1 Good answer. Also custom archives allow you to set storage options per file data chunk i.e vary compression format, data alignment per resource type etc – zebrabox May 23 '10 at 08:07
  • @zebrabox: oh yeah, I forgot about that. In another (sprite-specific) archive format we use, we compress per-frame separately from the metadata, and we do automatic de-duplication for frames that end up being the same, even across different sprites. – leander May 23 '10 at 14:41
1

File systems do have an overhead. Usually, a file takes disk space rounded up to some power of 2 (e.g. up to 4 KB), so many small files would waste space. Some modern file systems try to mitigate that, but AFAIK it's not widespread yet. Additionally, file systems are often quite slow when accessing multiple files. E.g. it is usually considerably faster to copy one 400 MB file than 4000 100 KB files.

File systems come in handy when you have to modify files, because they handle changing file sizes much better than any simple home-grown solution. However, that's certainly not the case for constant game data.

doublep
  • 25,014
  • 8
  • 67
  • 82
  • Any decent textures are likely to be over the 4KB mark, sounds definitely will be. So a possible quicker install is the main reason? By how much I suppose depends on the file system. – Ali Lown May 22 '10 at 20:31
  • It doesn't matter if the texture is over the 4KB mark - you still waste space, up to the next multiple of 4KB. – Kylotan May 24 '10 at 09:28
0

On Apple systems, the most common way is to use, as you suggest, directories. They are called Bundles, and are in the Finder represented as just one file, but if you explore them more, they're actually directories. This makes writing code and conserving memory when loading individual items out of this bundle very easy. :-) Also, this makes taking incremental backups of gigantic databases easy, as for instance your iPhoto database is just a bundle, so you just backup changed and new files

On Windows, however, I believe this is much harder to do, it will look like a directory "no matter what" (I'm sure smart people have found a solution that will make Explorer see certain directories as a single file, but it's not common).

From a games developer point of view, you're not dealing with so small files that disk space overhead is something you're very much concerned with, so I doubt @doublep's suggestion, since it makes for such a hassle, but it makes it much easier with a single file if users are to copy an entire game over somewhere, then it's easy to check if the entire set is correct.

And, of course, it's harder to read for people that shouldn't have access to it. But it's also harder to modify, which means harder to patch, and harder to write extensions. Someone that uses extensions a lot, prefers the directory structure: The Sims.

Were I the games developer, I'd love to go for individual files. Then again, I'd be using bundles as I'd be writing for the Mac ;-)

Cheers

Nik

niklassaers
  • 7,810
  • 18
  • 90
  • 141
  • See certain directories as a single file: zip file in 'store' mode (no compression). The interesting thing is that sometimes it is just one monolithic data file - yet has a complete directory like structure inside. So still some other reason to use one file, which is not easy to copy to other computers (all those pesky registry keys they like to add). – Ali Lown May 22 '10 at 20:52
0

I can think of multiple reasons.

As doublep suggested, files occupy more space on the disc than they require. So an archive saves space. 10k files (of any size) should save you 20MB when packed into an archive. Not exactly a large amount of space nowadays, but still.

The other reason I can think of is disc fragmentation. I suspect a heavily fragmented disc will perform worse when accessing thousands of separate files on a fragmented space. But I'm no expert in this field, so I'd appreciate if someone more experienced verified this.

Finally, I think this may also have something to do with restricting access to separate game files. You can have a bunch of Lua scripts exposed, mess with them and break something. Or you could have the outro cinematic/sound/text/whatever exposed and get spoiled by accessing it. I do that myself as well: I encrypt images with a multipass XOR key, pack text files and config variables into a monolithic file (zipped for extra security) and only leave music freely accessible. This way, the game's secrets will remain undiscovered for a bit longer :).

Or there may be another reason I never thought about :D.

mingos
  • 21,858
  • 11
  • 68
  • 105
  • I doubt disk fragmentation is the main reason (no problem on Linux fs's) and vista onwards does it automatically. Keeping secrets seems the most likely possibility since it is favoured by the big publishers. – Ali Lown May 22 '10 at 21:28
  • Actually from what i remember keeping secrets had no part of it since the data was encrypted on disk (several techniques) and hardware on GC and i believe PS decrypted it on the fly. You could also do this on small files but the main reason is performance. depending on the type of game you can have hundreds of files for one level which is horrible to load. Even in code (i guess you could generate a list but generating a file is pretty easy) –  May 22 '10 at 21:38
0

As you know games, especially with larger companies try to squeeze as much performance as they can. One technique is to have all the data in one large file and just DMA it to memory (think of it as a memcpy from CD to RAM). Since all the files are in one large one there will be no disk seeks and you can have a large number of files (which may cause large amount of seeks) all loaded quicky because of the technique.

  • Presumably this has only become a feasible option in the last few years - storing around 1GB in ram (about the amount Crysis takes from me) - is only possible when you can guarantee that the user WILL have several GB of ram, since as soon as it gets paged, it will no longer be worth doing. What about only putting part of a data file into DMA (is that possible or worth it?). – Ali Lown May 22 '10 at 21:25
  • 1
    DMA = Direct Memory Access. Its like a transfer and not a place for storage. In the early days almost everything would use DMA since there was little speed. This has always been done. But there are times when you dont use one large file such as DLLs. which is why you see so many. Typically all games use monolithic files unless they are meant to be edit or load times are quick enough without changing it. –  May 22 '10 at 21:31