0

In our project we need to create and maintain a collection of ancient manuscripts (which were scanned and converted to text using OCR software). Number of manuscripts is ca. 1000. Some of them were manually copied and passed through generations so different versions of them appeared over time. Differences in one version are usually small, but amount of versions of one manuscript might be significant, about 5-7 on average. Manuscripts are grouped into Groups based on their content and other factors. Our project serves as some sort of "middle-ware" or pure data supply for other projects which might present the information in a more user-friendly ways, like desktop GUI, website or mobile apps. Our infrastructure should enable collaboration (like error corrections, etc.) for those daughter projects and for individuals, something like a wiki.

Initial idea was to keep manuscripts as plain text files (in org-mode for lightweight markup and some metadata) while Groups should be represented by directories, like this:

Project/
├── Group1
│   ├── Group3
│   ├── manuscript_A
│   └── manuscript_B
└── Group2
    └── manuscript_C

Different versions of a manuscript should be kept in separate permanent (i.e. not to be merged) git branches, like branch manuscript_B-Athens_728.

Questions:

  1. The problem with such an approach is that if one uploads such git repository to e.g. GitLab all different branches of ALL manuscripts will be displayed at once rendering this versioning system unusable. Is there a way to group branches hierarchically or somehow "attach" a set of branches to one file (manuscript)?

  2. Is it possible somehow for a reader who reads in the middle of certain file to get an indication that on that particular place in text another version exists, that can be found in such and such branch?

  3. How well can git couple with the case when everything will be in Unicode: (a) manuscript content, (b) project, directory and file names, (c) branch names?

  4. Are there better approaches to organize such collection (in git)? I was thinking about creating a separate git repository per manuscript

like this:

Project/
├── Group1
│   ├── Group3
│   ├── Manuscript_A
│   │   └── manuscript_A
│   └── Manuscript_B
│       └── manuscript_B
└── Group2
    └── Manuscript_C
        └── manuscript_C

but this seems more difficult to maintain and you get an unnecessary hierarchy level - Manuscript_A type directories... Or is it possible to have several git repos in one directory each tracking its specific file?

user1876484
  • 520
  • 2
  • 13
  • "_separate git repository per manuscript_" you're on the right track... but I don't think Git is the tool you're looking for... – Attie Apr 10 '18 at 11:50
  • It is highly doubtful that git or any other version control system will be a good match for your needs. You should instead look at a system where you can upload "conceptual things" like a book or a manuscript and keep multiple versions of the same thing readily available, with annotations and metadata about why you kept each version. "version" in the concept of a version control system isn't meant in the context of a "copy", it is meant as a version of a timeline, the project evolved. If you have this type of version it might fit, but if it is different translations, for example, perhaps not. – Lasse V. Karlsen Apr 10 '18 at 12:06
  • @LasseVågsætherKarlsen We are actually going to use meta-data (utilizing org-mode). One of the reasons to rely on git for this is to enable decentralized (also offline) collaboration. – user1876484 Apr 10 '18 at 12:32

1 Answers1

1

Not every concept of "tracking different versions of X" is the same, and it doesn't sound like your project's concept of "tracking different versions of a manuscript" is anything close enough to the standard model for "tracking different versions of a program's source code" to make git the right tool.

A software version control system is about tracking the evolution of files over time, particularly when that evolution needs to be coordinated across files. None of that seems to apply here. So most of what git can do, you're "working around".

To answer your questions:

1) Yes. You can "namespace" branches

manuscriptA/version1
manuscriptA/version2
manuscriptC/version10
...

but it would be up to your tooling to make use of those namespaces. Or you can just use separate repos.

2) No. You would need to write significant external tooling to support this requirement. git can tell you where a file last changed within a branch history, but it can't generally display the version on one branch with annotations wherever another branch differs.

The closest concept in git to supporting this need would be to merge the transcripts, preserving conflict markers everywhere the versions differ. Of course, git conflict markers are far from the most intuitive way to represent this. And once you boil the manuscript down to a single conflicted file, you've removed the last vestige of "storing multiple versions of a file" from the picture, so git (or any software version control system) makes even less sense as a solution.

3) I think unicode is the least of your worries.

4) Almost certainly, but as I don't work in this field I don't know what they would be.

Mark Adelsberger
  • 32,904
  • 2
  • 24
  • 41
  • As for #4 - I meant in git. Thank you for your answer! – user1876484 Apr 10 '18 at 16:21
  • 1. [gitnamespaces for hiding branches](https://stackoverflow.com/questions/46854505/how-to-use-git-namespace-to-hide-branches) was exactly what I was looking for! 2. is nice to have, not a must! 3. great! 4. with namespaces I'm quite happy now! – user1876484 Apr 10 '18 at 21:04