19

I have an app that sends data to Google Analytics. I am interested in accessing and storing this data on a Hadoop cluster. I am guessing this raw data will be in the form of logs. In particular, I would like to see the user_id, the searches made by the user and the search option that he/she decided to pay for on the app.

How can I do this? I am completely new to GA and I was not the one who set up GA for the app. I am just trying to see if there is a way through which I can access this raw data.

Would like to add that I cannot use Big Query since we do not have access to it. And the folks who set up GA are not interested in upgrading to Universal Analytics.

Any help/thoughts/suggestions are appreciated.

feetwet
  • 2,702
  • 6
  • 36
  • 73
activelearner
  • 5,115
  • 14
  • 42
  • 77
  • If you still need access to raw unsampled GA (non premium) data, you can see my answer below - i don't know why someone down voted it, but it might be a way for you. – Michael Frost Billing Sep 13 '16 at 21:36

4 Answers4

17

There is no way to get the logs, but ..

The Google Analytics API will let you extract your data out of the system.

There are limits to what you can do:

  1. you are limited to 7 dimensions and 10 metrics per requests.
  2. There is also a quota of 10k requests per day per profile (view).
  3. some of the information you are talking about is not available. Unless the Google Analytics account is set up correctly.
  4. The data will still be aggregated in one way or another. The smallest time unit available in the API is minutes, so you will not be able to get raw data with timestamps, for example.

It may be good to note that a professional Google Analytics customer could export the raw data from GA to Big Query. Exporting data from BigQuery is free of charge, but storage and query processing is priced based on usage.

Premium analytics at a reasonable price for one flat annual fee of $150,000

Per Quested Aronsson
  • 9,570
  • 8
  • 47
  • 70
DaImTo
  • 72,534
  • 21
  • 122
  • 346
6

since we're supposed to answer the original question, there is no way to get actual raw Google Analytics logs other than by duplicating the server call system.

In other words, you need to use a modified copy of the analytics.js script to point to a hosted webserver that can collect server calls.

Long story short, you want your site to capture hits to http://www.yourdatacollectionserver.com/collect?v=1&t=pageview[...] instead of http://www.google-analytics.com/collect?v=1&t=pageview[...]

This is easily deployed using a tag manager such as Google's GTM, along with normal Google Analytics tags.

That will effectively create log entries in your web server which you can process using an ETL or Snowplow or Splunk or your favorite Python/perl/Ruby text parsing engine.

It is then up to you to process the actual raw logs into something manageable. And before you ask, this is not retroactive.

Julien Coquet
  • 217
  • 2
  • 6
  • Besides building your own ETL, you can use something like Google Analytics Parallel Tracking from Reflective Data that has a session processor (very similar to the one in GA) and data enrichment system built-in. https://reflectivedata.com/analytics-data-pipeline/ – Silver Ringvee May 25 '20 at 10:18
2

To get GA data click by click you can make queries in a way that gives you the ability to join data together.

First you need to prepare the data in GA. So with each hit you send, add some hashed value or the clientId + some timestamp into a custom dimension. This will give you the ability to join each query result.

E.g. (this is how we do it at Scitylana) This script below hooks into GA's tracking script and makes sure each hit contains a key for later stitching of query results

<script>
var BindingsDimensionIndex = CUSTOM DIMENSION INDEX HERE;
var Version = 1;

function overrideBuildTask() {
    var c = window[window['GoogleAnalyticsObject'] || 'ga'];
    var d = c.getAll();
    if (console) { console.log('Found ' + d.length + ' ga trackers') }
    for (var i = 0; i < d.length; i++) {
        var e = d[i]; var f = e.get('name');
        if (console) { console.log(f + ' modified') }
        var g = e.get('buildHitTask');
        if (!e.buildHitTaskIsModified) {
            e.set('buildHitTask', function(a) {
            window['_sc_order'] = typeof window['_sc_order'] == 'undefined' ? 0 : window['_sc_order'] + 1;
                var b = ['sl=' + Version, 'u=' + e.get('clientId'), 't=' + (new Date().getTime() + window['_sc_order'])].join('&');
                a.set('dimension' + BindingsDimensionIndex, b);
                g(a);
                if (console) {
                    console.log(f + '.' + a.get('hitType') + '.set.customDimension' + BindingsDimensionIndex + ' = ' + b)
                }
            });
            e.buildHitTaskIsModified = true
        }
    }
}
window.ga = window.ga || function() {
    (ga.q = ga.q || []).push(arguments);
    if (arguments[0] === 'create') { ga(overrideBuildTask) }
};
ga.l = +new Date();

</script>

Of course now you need to make some script that joins all the results you have taken out of GA.

Flexo
  • 82,006
  • 22
  • 174
  • 256
  • UPDATE: Scitylana now extracts "raw" or more precisely - unaggregated data from Google Analytics API without ANY plugins. We only use with the V4 reporting API to create a unaggregated dataset back in time. These data is great for data integration, aggregation and reporting on any platform you like. Data is delivered in BigQuery, S3 or Azure Blob storage – Michael Frost Billing Jan 05 '20 at 11:47
1

You can get aggregated data, ie. data you can see in your Google Analytics account, using Google Analytics API. To get raw data, you need to be a premium user (costs ~150k per Year). Premium users can export into Google BigQuery and from there to wherever you want.

  • 2
    and this answer differs from my answer how? – DaImTo Dec 03 '14 at 10:47
  • 1
    Even after you edited your post to copy some of the information of my post (eg. the 150k - see edited log of your answer) it differs... Eg. in the following way: A reader who is not familiar with the details of Google Analytics does not know what a "log" is, ie. if it refers to raw event data or the some kind of processed, eg. filtered, data. This is clear from my answer, but not from yours. – Johannes Schneider Dec 04 '14 at 14:33
  • 1
    Using Premium/360 does not give you access to raw data, merely a data dump from BigQuery, which is already processed. – Julien Coquet Sep 15 '16 at 11:56
  • 2
    I really don't understand this discussion since @activelearner doesn't have the resources for GA Premium. Why do you present it as an answer? When i try to present alternatives and even a working solution I get downvoted. I contrast to all here I actually present a solution, and it is just as free as GA. Everybody else presents commercial solutions with giant pricetag or states that its not possible to get to a finer grained data detail. – Michael Frost Billing Sep 16 '16 at 12:16
  • There are tools like this https://reflectivedata.com/analytics-data-pipeline/ (known as parallel tracking) that can send raw hit-level data from the site straight into BigQuery - no need for Google Analytics 360. – Silver Ringvee Aug 18 '20 at 09:45