#KB Data Visualization — Part I. There are two main approaches to EDA… | by Prof. Frenzel

6 min readFeb 2, 2023

Dear friends!

As you may have noticed, understanding and analyzing data can be an overwhelming task. But, with the help of visualization tools, you can effectively explore and make sense of complex data sets. I will walk you through visualization concepts and charting types as a part of every comprehensive Exploratory Data Analysis, and explain how it can help you gain insights into your data. Join us as we uncover the power of visualization tools. Learn how to build a story around your data in our ➡️next article Storytelling with Data.

Exploratory Data Analysis (EDA)

EDA is a vital stage in the data analysis process that involves utilizing statistical summaries and graphical depictions to uncover patterns, detect anomalies, test hypotheses, and verify assumptions in the data. EDA provides an initial examination of the data to enhance our understanding of the data structure, the relationships between variables, and to spot any potential issues or inaccuracies. This process is not only useful in making data “clean” and free from redundancies, which improves the accuracy of machine learning models, but also necessary to ensure that the appropriate statistical procedures are used in data analysis.

There are two main approaches to EDA, non-graphical descriptive statistics and graphical methods (EDV), which serve complementary purposes. Non-graphical approaches summarize and describe the dataset in a numerical way, using statistical measures such as mean, median, skewness, and standard deviation to understand the central tendency, variability, and distribution of the data. Graphical approaches, on the other hand, visually represent the data set and help identify patterns, relationships, and anomalies in the data. Graphical methods such as histograms, scatter plots, and box plots can reveal trends and deviations in the data that might not be immediately apparent from the numerical summaries. By combining both approaches, data analysts can gain a comprehensive understanding of the data set and identify areas for further investigation.

Data visualization: From Exploration to Storytelling

For most of us, our data journey starts with the Data Exploration stage and Visual Analytics. In this stage, we focus primarily on gaining a preliminary understanding of the data while improving the speed and capabilities of our data analysis process. This approach is designed to enable data analysts and experts to gain insights and make informed decisions in an efficient and effective manner. By utilizing interactive and scalable visual tools, the analysis process can be streamlined which empowers data analysts to quickly process and understand large amounts of data, leading to more accurate and confident decisions.

The next stage of Data Explanation and Data Storytelling begins when economic or statistical hypotheses are tested, analytics models are built, and we are finally ready to share insights with our target audience. This stage involves the use of visual aids, such as charts, graphs, and maps, designed to perfection to help convey complex information in a clear and engaging way. The goal of this approach is to make the data accessible and understandable to a wider audience, such as business stakeholders, policymakers, or the general public, who may not have a background in data analysis. By presenting the data in a visually appealing and intuitive way, Data Storytelling can help to drive action and impact by clearly communicating the data insights. In this way, we can unlock the full potential of data, and use it to make better decisions and drive positive change.

Exploratory Data Visualization (EDV)

Data visualization can range from being simple to complex, with univariate, bivariate, and multi-variate being the various levels of complexity. Essentially, the level of complexity is determined by the number of dimensions displayed in the visualization. It’s important to note that the more dimensions a visualization displays, the more challenging it can be to understand. But it can also mean that the more dimensions a visualization displays, the more valuable insights it can offer.

Univariate Visualization Tools

Univariate visualization is a technique for analyzing data with only one variable to understand its characteristics and patterns. It utilizes descriptive techniques to effectively answer questions about both numerical and categorical variables. For instance, when dealing with categorical variables, bar plots allow for a quick representation of the frequency or proportion of values in each category. Meanwhile, box plots and histograms are popularly used for summarizing numerical variables, providing valuable insights into minimum and maximum values, central location, spread, and any discernible patterns such as skewness or multimodality. These visualizations play a crucial role in aiding analysis by highlighting any key information that could impact it.

Bivariate Visualization Tools

Bivariate analysis can be a valuable tool for uncovering the connections between two variables in a dataset. Choosing the right visual representation and data type is key to effectively interpreting the results. For instance, when examining the relationship between a quantitative and categorical variable, side-by-side box plots or overlapping histograms can provide a clear picture. If the focus is on the relationship between two quantitative variables, a scatter plot is the ideal choice as it allows for the analysis of linear, exponential, and non-linear relationships. Meanwhile, when exploring the relationship between two categorical variables, side-by-side or stacked bar plots are effective in presenting the data. It’s important to consider the type of data being analyzed to make an informed choice in bivariate visual data analysis.

Multivariate Visualization Tools

Multivariate data refers to data that contains three or more variables, and the main purpose of analyzing such data is to study the relationships among these variables. Some popular multivariate visualization tools include Heat Maps, Matrix Plots, and Parallel Coordinate Plots. These tools can help you to find complex relationships between variables of the same or different types and gain a deeper understanding of their relationships. By adding a third variable to the scatter plot, it is possible to study the relationship between three or more variables simultaneously. For example, a scatter plot of rental prices versus square footage, with each point colored to represent the suburbs in which the rental is located, would show the positive linear relationship between price and area, as well as the differences in rental price and size between the suburbs. Another multivariate visualization tool is a heat map of a correlation matrix that shows the strength of correlation among variables. Additionally, multivariate charts, run charts, or bubble charts are other common types of multivariate visualizations used to explore complex relationships between multiple variables.

The four pillars of data visualization

The foundation of data visualization is built upon four pillars: distribution, relationship, comparison, and composition. The type of visualization to be used depends on the nature of the data and the information that needs to be communicated.

Distribution refers to the probability of occurrence of an outcome and is typically shown through frequency distributions such as histograms or curves. Dispersion, which is another type of distribution, shows how a variable is spread with respect to its central tendency and can be shown through boxplots.

Relationship refers to the connection between two or more variables. A good visualization can help identify these relationships without the need for complex statistical analysis. For example, the relationship between the height of a tree and its age can be visualized through a graph.

Comparison refers to comparing multiple variables in datasets or categories within a single variable. This can be shown through bar charts, line graphs, or other comparative visualization techniques. For example, a bar chart can compare the salary between two groups of observations, while a line graph can compare a variable between two groups along a time dimension.

Composition refers to showing the composition of one or more variables in absolute numbers and normalized forms. This can be shown through pie charts or stacked bar charts. While pie charts are considered old school, they can still present information in a visually appealing and familiar manner.

Data visualization is both an art and a science, and selecting the right type of chart to use is a key consideration. By incorporating an in-depth Exploratory Data Visualization into the EDA process, one can gain deeper insights into the data and make better-informed decisions. The visual representation of data not only makes it easier to understand, but also helps to identify patterns, trends, and anomalies that would otherwise go unnoticed.