Lies, damned lies, and statistics
Lies, damned lies, and statistics
Mark Twain made us aware that there are “Lies, damned lies, and statistics” and recently we found a new twist on that belief. Our project’s usage statistics were way bigger than expected and sent our customer into a fevered frenzy. However, all was not as it seemed. An underlying storage issue in BigQuery meant we were getting figures three times greater than reality.
After careful investigation on our part combined with a prompt and professional response to the BigQuery team we solved the issue.
This blog post describes the problem in detail, presents the method we took in rooting it out as well as the lessons learnt. We believe that there are some useful insights for any project that collects and processes statistical data (i.e. nearly every serious project out there).
Quite recently we experienced a very interesting problem with statistics in one of our projects. We collect various statistics and we had a vague feeling that some of the stats were wrong. It was not entirely obvious, where the problems lay, it looked like all our processing pipeline and queries we’ve done are perfectly fine. But the statistics for total time spent in the application were at some point quite high - unusually high in fact, about 3x what we could expect. We also saw some discrepancies of our recorded valid purchases vs. the iTunes totals. But there was no obvious error.
The statistics were not extremely important for a while, our project was working fine, we were building the final version of the project. We captured all the raw data, so we knew that we could fix it later if needed. But after the last version was released, it became very urgent as our customer needed to know all the numbers and had to be certain of their validity. Since we had some doubts, we started to look closer and dig a bit deeper. And then we saw a sudden drop of the “time spent” in the app at 11th of October.
At the same time, neither daily active users from the same statistics nor average session length (both shown below) reported by Google Analytics showed anything unusual:
Now is the time to mention what technology stack we are using. It might not be the most demanding app in the world, but with continuous several hundreds requests per minute and many thousands of people using the app daily - we have pretty decent traffic and already tens of millions of rows of statistical data to process. We use Python + Django app on heroku as our backend server. We use a Postgres database. But due to the pretty high number of statistical data (hundreds of millions in pretty near future) we periodically move the statistical data to Google’s BigQuery. Otherwise our Postgres database starts to get a little clogged, and running any aggregation directly on the database could easily kill the on-line processing. So we use BigQuery to store and aggregate the data and we generate all the statistics directly in shared Google Spreadsheets. Our Google Spreadsheets use AppScript to refresh the statistical data and charts semi automatically - using BigQuery’s AppScript API. So we can then manually trigger a data refresh and at the same time we have pre-aggregated, statistics in spreadsheet, which we can analyse further. We found it to be a neat and flexible solution - which is much cheaper and more flexible, than using dedicated analytical/statistical engines (and enough for our needs at this moment of life of the product).
We also had similar discrepancies in the purchase reports - totals were pretty off for some time , but they were difficult to compare as iTunes does not provide useful aggregated reports. Surprisingly (?), it turned out that we also experienced sudden drop in purchase data around 11 Oct So we knew that we should look at beginning of October data.
First we thought the problem lay in the queries used to retrieve the data, but careful examination have shown that they are all ok.
Then we looked at some individual entries in BigQuery. Some late night head scratching and debugging sessions … and (gotcha!) we found that we had significant number of duplicated records in our BigQuery data. Since we had no historical data in Postgres database, we could not really find if history of user sessions were already duplicated in our database or not/
Initially we suspected our mobile client applications to send us wrong data, but we have not upgraded our app shortly before 11th October. And we could not find any problem there. We also noticed that our purchase data also contained some duplicates, but ONLY in BigQuery (we keep all purchases in DB of course - there are not 10s of millions of them yet ;) ). That meant the problem couldn’t be with the client.
Then we suspected our Django server which got a serious upgrade around the same date (quite a coincidence ugh). We looked through all differences in code
- the statistics code was not touched since end of July. So that was not it.
Then we looked at the scripts that move the data from our database to BigQuery. After very deep dive, we worked out a scenario (involving restart of machine in certain points of the move process and starting it on another machine). That could lead to some small amount of data duplication (such case should be extremely rare). But - not duplication of almost all data in some longish period of time. And … our logs did not show it at all. Not a single case. And we should have seen it, if this had been the case.
We looked frantically for some more help online - of course we tried to find help before, but the problem was too vague to find any useful information about similar cases.
The last thing we expected was that data was being duplicated by Google’s BigQuery, which is rock solid, fast and really good in crunching through vast amount of data. However after some hunting we found: http://stackoverflow.com/questions/12754484/seeing-duplicated-data-from-load-action-on-09-27-2012 It was a very, very, very rare problem that happened to one of BQ users with some data duplication while data coalescing. The one common thing between those problems was, that it was worked around/fixed at 11 Oct by BigQuery team. Coincidence? No. It turned out to be the very same problem.
The Google’s BigQuery team was extremely helpful and responsive - especially Jordan, the Tech Lead of the BigQuery backend. We got confirmation from them within hours of the request and over the next few days, even if Jordan was travelling between Europe and US, we got our data de-duplicated. Apparently, this case was extremely rare, and our case helped Jordan to write an automated script that could find and fix the duplicated data (I can only imagine, how difficult task you have when you go through Petabytes of data that you process). So, I believe they now have the tool to fix the problem in case it also happened to someone else.
Finally - this is how our statistics look like now. Pretty different picture, huh? Now our customer can happily use the data. It’s correct and consistent with other sources we have.
As with all the solved problems, there is always something you can learn and carry forward. In short, this is our learning from this issue:
- Never trust your stats. Question it. If you have slightest doubts - dig deeper.
- Have several alternative ways of collecting stats: Google Analytics, Flurry, Mixpanel, custom statistics: they might and should be run in parallel. Cross-check data from those different sources - and if you see differences: dig deeper.
- There are no “immune” parts of the system. Use the old Sherlock Holmes rule when investigating problems: “How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”
Principal Software Engineer