Students, even those with minimal computer science training, don’t often have issues with the concept of a data type when learning Python. A string is put inside a string, and an integer holds an integer; so far, so good. But the primitive data types pose few issues for understanding because students are relying on intuitions from the physical world, intuitions that may not be as reliable when we increase the complexity of our data structures or the work we’re asking those data structures to do.
Alright, I’ll be honest: an understanding of abstract data types (ADT) is not essential for a beginning data scientist. But it’s unquestionably a fun topic (do not question me on this!). Additionally, the point of the code we write is to store and manipulate data, so it’s important to know how to handle this data efficiently. As we progress in our data science journey, knowing how data storage is handled will help us write more efficient code.
We have to start by establishing definitions of higher and lower level concepts before circling back to the concept of abstract data types near the end, so hang on. Note that the following definitions are most relevant for an object-oriented paradigm and I’m concerned only with high-level programming languages.
Primitive Data Types
There’s a lot of jargon in computer science, so let’s nail down a few definitions. Primitive data types are, as their name suggests, the building blocks of information storage in computer programs. They are the lowest-level representation of information in high-level languages (say that five times fast). That is, they’re not the zeroes and ones of information that the computer understands, but they are the lowest level of abstraction you’ll deal with when writing in high-level languages like Python or Java.
In Python, for example, the primitive data types are integers, floats, booleans and strings. A data type is described by the (possibly infinite) collection of information that the type can hold. For example, an integer type is defined as all the integers (0, -1, 1, -2, 2, -3, 3, …). A data type also includes the operations that can be carried out on that type. For example, integers can be added or subtracted. The collection of integers (0, -1, 1, …), along with the operations (addition, subtraction, etc.) make up the data type. This is a mathematical definition.
Non-Primitive Data Types
Some data types are not themselves primitive but are containers for primitive data types. For example, arrays are collections of a primitive data type. Non-primitives that are collections of primitives are called composite types. Not all non-primitives are composites, though.
Non-primitive types come in a variety of flavors. Some can hold only homogenous collections (e.g. arrays in C++), and some can hold heterogenous collections (e.g. lists or tuples in Python). Other structures include stacks, queues, trees, heaps, hash tables and many others. A discussion of all possible data structures is beyond the scope of this post, however. What we’re interested in here is what these data structures have in common, not what separates them in terms of behavior.
It’s important to repeat that a data type is a mathematical definition, and is a distinct concept from how a type is implemented in software. Defining a data type requires only the logical definition (e.g. an integer with the division operation defined on it), but does not have anything to say about the implementation in your favorite programming language (e.g. for an integer, whether it’s short, long, signed, or unsigned). This distinction may appear to be overkill when talking about something like an integer. But when we start talking about more complex data types like trees or arrays, the difference matters (more on this in a bit).
Data Types or Data Structures?
I’ll admit to using the terms data type and data structure interchangeably, even when there’s a technical difference. Am I particularly careful about using the correct one? Not even half of the time. And in common parlance, it usually doesn’t matter. But we’re here to learn, so let’s dive into the difference.
A data type is defined by the collection of information that it can hold (e.g. the infinite set of all possible strings) combined with the operations that can be performed on the type (e.g. concatenating strings).
A data structure, however, is the data type incarnate, or the software implementation of an abstract data type (ADT). Whoa, I thought we were talking about data types, where did this abstract data type come from? It’s true, we haven’t defined abstract data types yet, so let’s take a detour to do that, then get back to data structures.
Abstract Data Types (Finally!)
An abstract data type is the logical definition of a data type in software. An ADT defines the type of information a type will hold and the methods that can be applied to it.
One example of an ADT is a deque, or double-ended queue. A deque is a generalization of a queue, but where items can be added and removed both to/from the front and back.
The ADT definition of a deque might include the following:
- An ordered collection of entities
- New entities can be inserted on either end of the collection
- Only entities on the ends can be examined for their values
- Entities can be deleted from either end of the collection, but only the ends can be deleted
Nothing from the ADT definition of a deque tells us how to implement it in software. A user of a deque would be expecting methods to add, remove and return the values of the ends, and that’s what the ADT promises the user. But the user doesn’t care about whether the deque is a linked list or array under the hood.
The ADT definition of a deque, combined with a given implementation (say, linked list), together make up a data type. The ADT can be thought of as the promise that the type makes to the user: I’ll let you add entities in this way, delete them in this way, and traverse the collection in this way. But the promise is separate from the implementation, which is the data structure.
The ADT and the data structure are both in service of the data type.
Why Multiple Implementations?
Imagine an implementation of a database. All databases by definition support insertion, deletion and search. If we’re collecting streaming data and need to write to our database a lot, we’d prefer an implementation that is really efficient at insertion. But if we have a fairly static database that our users are querying a lot, we might prefer an implementation that’s slow to insert but fast at queries.
Not all programs that use the same data type will use it in the same way. There might be some methods on a type that one program relies on heavily that a second program jettisons entirely. For this reason, we don’t want to be tied to a single implementation when using a given data type.
Our user doesn’t care about the implementation of the type, as long as it does what it’s supposed to do. And we shouldn’t burden the user by the implementation details (a concept called encapsulation).
We’ve learned three major terms.
- Data type: a mathematical definition of information, such as the type of information (integer, string, etc.) and the operations that are defined on it
- Abstract data type (ADT): the interface of a data type in software, such as methods to pop from and push to a stack
- Data structure: A given implementation of an ADT, such as using doubly-linked lists to implement a deque
Data scientist at Metis. Previously a risk consultant at McKinsey & Co. I have a master's in public policy and econometrics from the University of Chicago and a B.A. in economics and political science from Northwestern University. I'm currently working towards a master's in computer science from DePaul University in Chicago.