Dear friends!
Every exploratory data analysis (EDA) has to start somewhere, and that is usually with understanding the datatypes. In most cases, your tool will tell you that the datatype does not work — what do you mean I can’t create a correlation table based on text data? But in frequent cases, it will not, and you will need to classify the datatype correctly yourself (or with the help of GenAI) before using it. In this article, I will discuss fundamental datatypes for analysts, how they impact your analysis, and ways to leverage them effectively.
Why Data Types Are Important
Data types are the attributes of data that determine how a computer system or analytics software will interpret their value. But they also serve as a means of categorizing and organizing data, making it easier for the analyst to understand and manipulate, and specify what type of mathematical operations can be applied. There are two main types of data in statistics — numerical (quantitative) and categorical (qualitative).
Quantitative data refers to data that can be expressed as a number and can be measured by numerical variables. It answers questions such as “how many” or “how much.” Examples of quantitative data in the context of cars include the number of miles driven, the engine size in liters, or the fuel efficiency measured in miles per gallon. There are two subtypes of quantitative data: discrete data (data that only involves integers and cannot be divided into parts) and continuous data (data that can take any value within a certain range). The data type used to store this kind of data can be either integer or float. For example, if we are measuring the engine size of a car, it could be stored as a float data type, as engine size is usually expressed with decimals. On the other hand, if we are counting the number of cars manufactured, it would be stored as an integer data type.
Qualitative data refers to data that cannot be expressed as a number and cannot be measured. It consists of words, pictures, and symbols and answers questions about the quality of data, rather than quantity. Examples of qualitative data include colors, names, or brands. There are two subtypes of qualitative data: nominal data (data used only for labeling variables without any quantitative value) and ordinal data (data that shows where a number is in order and indicates superiority). Suppose a survey was conducted to gather data on the preferred brand of cell phone among individuals. The possible responses were Apple, Samsung, Huawei, Xiaomi, and Other. The responses are considered nominal data because they are categorical and do not have any inherent order or numerical value.
In another survey, individuals were asked to rate their satisfaction with their current job on a scale of 1–5, with 1 being “very dissatisfied” and 5 being “very satisfied”. The responses are considered ordinal data because they have a defined order and can be ranked, but the intervals between the ratings are not necessarily equal.
Data Type Conversion
Data type conversion often starts with encoding, the process of transforming data into a different format using specific algorithms. Consider a dataset containing information about the make and model of cars owned by individuals. This categorical data can be transformed into numerical data through encoding. For instance, each make and model can be assigned a numerical value, such as 1 for Toyota, 2 for Honda, and so forth. Techniques like one-hot encoding, where each category is represented as a binary vector, are also common.
Another important aspect of data type conversion is binning, the process of transforming numerical data into categorical data by grouping numbers into categories. This technique simplifies analysis and creates more interpretable models. For instance, numerical data such as horsepower in a car dataset can be transformed into categorical data through binning. Horsepower values can be grouped into categories like low (0–100), medium (100–200), and high (200 and above).
Binning helps create more advanced analyses and visualizations by reducing the complexity of numerical data, leading to more meaningful and useful insights.
Understanding the different types of data and their subtypes, as well as the ability to correctly identify and convert data types, is essential for conducting accurate and effective data analysis. This is why data types are important in exploratory data analysis and machine learning projects. Data analysts should understand the differences between nominal, ordinal, discrete, and continuous data to select the correct data type for their analysis and avoid mistakes.