2

I downloaded the freebase-rdf-latest from freebase.com. I uncompressed it and now I have a file of 380.7Gb. How can I read that data? Which program do you recommend me? Thanks for your help!

mariana
  • 39
  • 2
  • Product recommendations are off topic here. –  Feb 02 '15 at 19:18
  • @SabreTooth mariana isn't asking for a "product recommendation" they're asking for the best way to accomplish their objective. Why is that an issue for you? – Tom Morris Feb 03 '15 at 03:11

2 Answers2

3

I'll disagree with @Nandana and say that you definitely should not load it into a triple store for most uses. There's a ton of redundancy in it and, even without the redundancy, usually you're only interested in a small portion of it.

Also, for most applications, you probably want to leave the file compressed. You can probably decompress it quicker than you can read the uncompressed version from the file system. If you need to split it for processing in a MapReduce environment, the file is (or at least used to be) a series of concatenated compressed files which can be split apart without having to decompress them.

Nandana has a good suggestion about considering derivative data products. The tradeoff to consider is how often they are updated and how transparent their filtering/extraction pipeline is.

For simple tasks, you can get pretty far with the very latest data using zgrep, cut, and associated Unix command line tools.

Tom Morris
  • 10,055
  • 28
  • 48
  • Would the downvoter care to add a comment as to why the answer wasn't considered helpful or on-topic? – Tom Morris Feb 04 '15 at 05:37
  • 1
    It was reallly helpful indeed. Thank you all. I need 15 reputation to vote up the answers. – mariana Feb 04 '15 at 14:25
  • "You can probably decompress it quicker than you can read the uncompressed version from the file system." - key point, thanks! – a darren Jan 02 '16 at 17:00
2

You have to load the data to a triple store such as Virtuoso. You can take a look at how load the data in following references.

However, you might be interested in other projects that provide a cleaned version of freebase pre-loaded into a triple store.

SindiceTech Freebase distribution Freebase data is available for full download but as today, using it "as a whole" is all but simple. The SindiceTech Freebase distribution solves that by providing all the Freebase knowledge preloaded in an RDF specific database (also called triplestore) and equipped with a set of tools that make it much easier to compose queries and understand the data as a whole.

:BaseKB :BaseKB is an RDF knowledge base derived from Freebase, a major source of the Google Knowledge Graph; :BaseKB contains about half as many facts as the Freebase dump because it removes trivial, ill-formed and repetitive facts that make processing difficult. The most recent version of :BaseKB Gold can be downloaded via BitTorrent, or, if you wish to run SPARQL queries against it, you can run it in the AWS cloud, pre-loaded into OpenLink Virtuoso 7.

Community
  • 1
  • 1
Nandana
  • 1,190
  • 7
  • 17