share

BUSINESS

6min read

Lies, damned lies, and statistics

Abstract

Mark Twain made us aware that there are “Lies, damned lies, and statistics” and recently we found a new twist on that belief. Our project’s usage statistics were way bigger than expected and sent our customer into a fevered frenzy. However, all was not as it seemed. An underlying storage issue in BigQuery meant we were getting figures three times greater than reality.

After careful investigation on our part combined with a prompt and professional response to the BigQuery team we solved the issue.

This blog post describes the problem in detail, presents the method we took in rooting it out as well as the lessons learnt. We believe that there are some useful insights for any project that collects and processes statistical data (i.e. nearly every serious project out there).

Problem

Quite recently we experienced a very interesting problem with statistics in one of our projects. We collect various statistics and we had a vague feeling that some of the stats were wrong. It was not entirely obvious, where the problems lay, it looked like all our processing pipeline and queries we’ve done are perfectly fine. But the statistics for total time spent in the application were at some point quite high - unusually high in fact, about 3x what we could expect. We also saw some discrepancies of our recorded valid purchases vs. the iTunes totals. But there was no obvious error.

The statistics were not extremely important for a while, our project was working fine, we were building the final version of the project. We captured all the raw data, so we knew that we could fix it later if needed. But after the last version was released, it became very urgent as our customer needed to know all the numbers and had to be certain of their validity. Since we had some doubts, we started to look closer and dig a bit deeper. And then we saw a sudden drop of the “time spent” in the app at 11th of October.

stats1.png

At the same time, neither daily active users from the same statistics nor average session length (both shown below) reported by Google Analytics showed anything unusual:

stats2.png

Technology

Now is the time to mention what technology stack we are using. It might not be the most demanding app in the world, but with continuous several hundreds requests per minute and many thousands of people using the app daily - we have pretty decent traffic and already tens of millions of rows of statistical data to process. We use Python + Django app on heroku as our backend server. We use a Postgres database. But due to the pretty high number of statistical data (hundreds of millions in pretty near future) we periodically move the statistical data to Google’s BigQuery. Otherwise our Postgres database starts to get a little clogged, and running any aggregation directly on the database could easily kill the on-line processing. So we use BigQuery to store and aggregate the data and we generate all the statistics directly in shared Google Spreadsheets. Our Google Spreadsheets use AppScript to refresh the statistical data and charts semi automatically - using BigQuery’s AppScript API. So we can then manually trigger a data refresh and at the same time we have pre-aggregated, statistics in spreadsheet, which we can analyse further. We found it to be a neat and flexible solution - which is much cheaper and more flexible, than using dedicated analytical/statistical engines (and enough for our needs at this moment of life of the product).

Investigation

We also had similar discrepancies in the purchase reports - totals were pretty off for some time , but they were difficult to compare as iTunes does not provide useful aggregated reports. Surprisingly (?), it turned out that we also experienced sudden drop in purchase data around 11 Oct So we knew that we should look at beginning of October data.

First we thought the problem lay in the queries used to retrieve the data, but careful examination have shown that they are all ok.

Then we looked at some individual entries in BigQuery. Some late night head scratching and debugging sessions … and (gotcha!) we found that we had significant number of duplicated records in our BigQuery data. Since we had no historical data in Postgres database, we could not really find if history of user sessions were already duplicated in our database or not/

Initially we suspected our mobile client applications to send us wrong data, but we have not upgraded our app shortly before 11th October. And we could not find any problem there. We also noticed that our purchase data also contained some duplicates, but ONLY in BigQuery (we keep all purchases in DB of course - there are not 10s of millions of them yet ;) ). That meant the problem couldn’t be with the client.

Then we suspected our Django server which got a serious upgrade around the same date (quite a coincidence ugh). We looked through all differences in code

  • the statistics code was not touched since end of July. So that was not it.

Then we looked at the scripts that move the data from our database to BigQuery. After very deep dive, we worked out a scenario (involving restart of machine in certain points of the move process and starting it on another machine). That could lead to some small amount of data duplication (such case should be extremely rare). But - not duplication of almost all data in some longish period of time. And … our logs did not show it at all. Not a single case. And we should have seen it, if this had been the case.

Root cause

We looked frantically for some more help online - of course we tried to find help before, but the problem was too vague to find any useful information about similar cases.

The last thing we expected was that data was being duplicated by Google’s BigQuery, which is rock solid, fast and really good in crunching through vast amount of data. However after some hunting we found: http://stackoverflow.com/questions/12754484/seeing-duplicated-data-from-load-action-on-09-27-2012 It was a very, very, very rare problem that happened to one of BQ users with some data duplication while data coalescing. The one common thing between those problems was, that it was worked around/fixed at 11 Oct by BigQuery team. Coincidence? No. It turned out to be the very same problem.

Solution

The Google’s BigQuery team was extremely helpful and responsive - especially Jordan, the Tech Lead of the BigQuery backend. We got confirmation from them within hours of the request and over the next few days, even if Jordan was travelling between Europe and US, we got our data de-duplicated. Apparently, this case was extremely rare, and our case helped Jordan to write an automated script that could find and fix the duplicated data (I can only imagine, how difficult task you have when you go through Petabytes of data that you process). So, I believe they now have the tool to fix the problem in case it also happened to someone else.

Result

Finally - this is how our statistics look like now. Pretty different picture, huh? Now our customer can happily use the data. It’s correct and consistent with other sources we have.

stats3.png

Lessons learnt

As with all the solved problems, there is always something you can learn and carry forward. In short, this is our learning from this issue:

  • Never trust your stats. Question it. If you have slightest doubts - dig deeper.
  • Have several alternative ways of collecting stats: Google Analytics, Flurry, Mixpanel, custom statistics: they might and should be run in parallel. Cross-check data from those different sources - and if you see differences: dig deeper.
  • There are no “immune” parts of the system. Use the old Sherlock Holmes rule when investigating problems: “How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”

share


JarekPrincipal Software Engineer

LEARN MORE

Contact us if you have any questions regarding the article or just want to chat about technology, our services, job offers and more!

POLIDEA NEWSLETTER

Sign in and expect sharp insights, recommendations, ebooks and fascinating project stories delivered to your inbox

The controller of the personal data that you are about to provide in the above form will be Polidea sp. z o.o. with its registered office in Warsaw at ul. Przeskok 2, 00-032 Warsaw, KRS number: 0000330954, tel.: 0048 795 536 436, email: hello@polidea.com (“Polidea”). We will process your personal data based on our legitimate interest and/or your consent. Providing your personal data is not obligatory, but necessary for Polidea to respond to you in relation to your question and/or request. If you gave us consent to call you on the telephone, you may revoke the consent at any time by contacting Polidea via telephone or email. You can find detailed information about the processing of your personal data in relation to the above contact form, including your rights relating to the processing, HERE.

Data controller:

The controller of your personal data is Polidea sp. z o.o. with its registered office in Warsaw at ul. Przeskok 2, 00-032 Warsaw, KRS number: 0000330954, tel.: [0048795536436], email: [hello@polidea.com] (“Polidea”)

Purpose and legal bases for processing:

 

Used abbreviations:

GDPR – Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016
on the protection of natural persons with regard to the processing of personal data and on the free movement
of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)

ARES – Polish Act on Rendering Electronic Services dated 18 July 2002

TL – Polish Telecommunications Law dated 16 July 2004

1)        sending to the given email address a newsletter including information on Polidea’s new projects, products, services, organised events and/or general insights from the mobile app business world |art. 6.1 a) GDPR, art. 10.2 ARES and art. 172.1 TL (upon your consent)

Personal data:name, email address

2)       statistical, analytical and reporting purposes |art. 6. 1 f) GDPR (based on legitimate interests pursued by Polidea, consisting in analysing the way our services are used and adjusting them to our clients’ needs, as well as developing new services)

Personal data:name, email address

Withdrawal of consent:

You may withdraw your consent to process your personal data at any time.

Withdrawal of the consent is possible solely in the scope of processing performed based on the consent. Polidea is authorised to process your personal data after you withdraw your consent if it has another legal basis for the processing, for the purposes covered by that legal basis.

Categories of recipients:

Your personal data may be shared with:

1)       authorised employees and/or contractors of Polidea

2)       persons or entities providing particular services to Polidea (accounting, legal, IT, marketing and advertising services) – in the scope required for those persons or entities to provide those services to Polidea

 

Retention period:

1)       For the purpose of sending newsletter to the given email address – for as long as the relevant consent is not withdrawn

2)       For statistical, analytical and reporting purposes – for as long as the relevant consent is not withdrawn

Your rights:

 

Used abbreviation:

GDPR – Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016
on the protection of natural persons with regard to the processing of personal data and on the free movement
of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)

According to GDPR, you have the following rights relating to the processing of your personal data, exercised by contacting Polidea via [e-mail, phone].

1)       to access to your personal data (art. 15 GDPR) by requesting sharing and/or sending a copy of all your personal data processed by Polidea

2)       to request rectification of inaccurate personal data
(art. 16 GDPR) by indicating the data requiring rectification

3)       to request erasure of your persona data (art. 17 GDPR); Polidea has the rights to refuse erasing the personal data in specific circumstances provided by law

4)       to request restriction of processing of your personal data (art. 18 GDPR) by indicating the data which should be restricted

5)       to move your personal data (art. 20 GDPR) by requesting preparation and transfer by Polidea of the personal data that you provided to Polidea to you or another controller in a structured, commonly used machine-readable format

6)       to object to processing your personal data conducted based on art. 6.1 e) or f) GDPR, on grounds relating to your particular situation (art. 21 GDPR)

7)       to lodge a complaint with a supervisory authority,
in particular in the EU member state of your habitual residence, place of work or place of the alleged infringement if you consider that the processing
of personal data relating to you infringes the GDPR
(art. 77.1 GDPR)

No obligation to provide data:

Providing your personal data is not obligatory, but necessary for Polidea to provide you the newsletter service

Refusal to provide the above data will result in inability to receive the newsletter service.

Profiling

In the process of providing the newsletter service, we make decisions in an automated way, including profiling, based on the data you provide.

 

“Profiling” means automated processing of personal data consisting of the use of your personal data to evaluate certain personal aspects relating to you, in particular to analyze or predict aspects concerning your personal preferences and interests.

 

The automated decisions are taken based on the analysis of clicked and viewed content. They affect the targeting of specific newsletter content to selected users registered to receive the newsletter service, based on the anticipated interests of the recipient.