Data Analysis Methods Guide
They say information is the most valuable entity on the market. Before, we could only imagine the world where we can exchange big data loads with such speed. With this amount of data, we need a way to store and organize it in the most optimized way. Thanks to data analysis, we can learn new things about our business and the world.
Let’s assume you have a great idea for a new product for your company—a special smartphone that boosts user’s productivity. As the owner of a prosperous IoT company, you have access to loads of information on how your customers use your products. You also have access to the current statistics about productivity and motivation among various age-groups. Additionally, your company has a UX team which can invite users to conduct a survey and collect relevant data for this project.
All the collected, “raw” data, are usually put in multiple emails and some shared files of our coworkers. It seems overwhelming as we probably need only some, specific information, not all the data we have.
First, we have to compare some numbers or create aggregations to find out what is the average age of people who might have problems with productivity. In order to do it, we have to prepare all the information we want to retrieve from the collected data and determine what is valuable for us.
First of all, we need to reject all the data that:
- might be a fraud (research can be doubtful, for example, when a survey didn’t happen, because bots were accessing your services),
- doesn’t match our criteria,
- was not put through the correct procedure of accessing data (meaning that an analyzed person wasn’t following the survey procedure properly),
- is incomplete.
After validating the data we have to deal with the errors. For example, when a comma is put instead of a dot in floating point numbers, or there are some empty spaces. The latter can be dealt with in various ways:
- Remove the items that contain empty values.
- Replace numerical, empty values with medium/mediana.
- Replace with last value observed in given category (Last Observation Carried Forward).
- Try obtaining these values again.
We can distinguish two types of data analysis methods: quantitative and qualitative.
Thanks to these methods, we are able to find the most interesting relations and aggregations out of the data for our project. For example, we may have to determine the average age of given sex of the users, which would prove to be useful when adjusting proper advertising for our target group.
When it comes to a quantitative method, we use the descriptive analysis to reveal relations of data. We can use one of the following basic variables:
- median—e.g. what’s the numerical average of time that our customers spend on meditation,
- mean—e.g. the midpoint of the age of our customers,
- range—e.g. the highest and lowest time spent on our app from all sessions,
- percentage—e.g. what part of students are using our app,
- frequency—e.g. how often users open our app,
- mode—e.g. what’s the most common symptom of lack of productivity of our users,
but also more sophisticated methods such as:
- correlation—e.g. between age and productivity issues,
- regression—e.g. prediction of how the stress levels can change for our users,
- analysis of variance—e.g. differences between motivation levels among our age groups.
Thanks to these methods, we can perceive and relate numbers from the collected data. Nevertheless, statistics should be used carefully because sometimes they can lead to wrong conclusions.
When it comes to analyzing non-numerical data, we can use qualitative methods. Thanks to them, we can analyze the content (for example, what were the exact words used in the responses) or context of given answers (such as emotions or the reponder’s everyday habits). To help you with qualitative methods, you can use some machine learning tools such as Google Natural Language. But there are also cases where we can only count on human judgment in analysis, for example, when it comes to identifying irony or sarcasm.
How to apply all this into practice? There is always a good old paper or excel file which you can use to store, check and work on your data. Once our dataset grows we might need to arrange more advanced tools or write our own code that will process your data on your machine or cloud.
Various languages give the possibility to analyze data thanks to wide selections of libraries and built-in features. The most common are R and Python. Which one to choose? Why are these two the most popular?
R was created with the purpose of processing statistics and preparing visualizations. It is much easier to learn for people who don’t have a background in programming, or they aren’t really interested in learning it. In comparison, Python is an easy to learn language with various data analysis tools.
For programmers who would like to jump into the data world, you can start by using pandas, NumPy or Matplotlib. Usually, it’s best to use the language that your team has the most experience in. However, if your team is multidisciplinary with broad competences, you can process data with tools such as Apache Beam and write code in Java, Python, and Go.
There are also tools and frameworks worth checking out that can help you with data analysis and visualizations:
- Tableau. A very popular tool. Thanks to it it’s easy to create graphs and diagrams understandable for everyone and powered with data in real-time.
- If you are looking for an open-source alternative to Tableau, there is a Apache Superset project. The tool is still in the incubation mode, but we believe it’s worth checking out or even improving by contributing or sending suggestions.
- In case of data preparation, there is a possibility to use Talend, which ensures reading and transforming data, assuring good data quality. It includes integrations with various data stores and processing tools.
- When it comes to data storage, the interesting option is Snowflake, which provides a lot of integrations to other sources. Snowflake provides fast and multiregional service where you can store your data in various formats (relational tables, JSONs, Parquets, etc.).
- Apache Beam is an open-source project that allows you to retrieve data from sources of choice, transform data on various processing engines (such as Flink, Spark, Dataflow and save it on another output of choice.
- Root is a data analyst framework, perfect for more scientific purposes. It gives a lot of useful features such as visualizations, data analysis, designing, or GUI.
In the time of data, we are lucky to have an extensive choice of tools for processing, analyzing and displaying it. Thanks to them we can constantly improve, not only our services but also our knowledge about the world.
Lead Software Engineer