The importance of using the right data structure

Have you ever read The Algorithm Design Manual? No? Well, I’m not saying you should, but it might be useful and it wouldn’t hurt. The name might give you a hint about its content, but for me, it’s the way this content is explained that makes this book great.

I think the way the author presents its’ concepts, giving real-life examples, makes all the difference. And of course, the famous Catalog of Algorithmic Problems is an invaluable resource.

Despite all that, the author went into the trouble of creating an entire chapter about data structures. If you go through that chapter, you soon realize the importance it has to its author. More: he even states that, sometimes, you might not even need to change/improve your algorithm. You probably just need to use the right data structure for the task at hand.

We could go on and argue which data structure is the best for each task (read the book!), but I will instead demonstrate the importance of using the right one for the job, using a practical example. For that we will make use of Problem 23. from Project Euler, which reads like this:

Non-abundant sums

Problem 23

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Check this for a possible solution. So, what do we have here? A few things:

a function to get the proper divisors of a number;
the main loop to go up to the limit that finds out if a number is abundant; if that number can be written as the sum of two abundant numbers, add it;
s stores the sum;
abundant_numbers is a list that stores the abundant numbers.

Let’s just run this and check how long it takes:

$ time python problem23.py
4179871

real 3m20.546s
user 3m20.437s
sys 0m0.008s

Right answer! But 3m20s? I can’t live with that. I start to get itchy and I have to find a way to reduce that time. The first thing that comes into mind is that our approach might not be the most appropriate. Maybe there’s a better way to to find a number’s proper divisors. Maybe there’s a more efficient way to find out if a number is the sum of two abundant numbers. But let’s take a step back, and check our data structures first.

The only one we’re using is a list to store our abundant numbers. So, let’s take a look at this line of code:

if not any((i-a in abundant_numbers) for a in abundant_numbers)

You can imagine that, while the list is growing in size, so is the time it takes to do each verification. Since it is a simple list, poor Python has no other way besides iterating through all its items and seeing if it matches what we desire. Well, I’m starting to think that if we had a data structure that would allow us a more efficient checking process, that huge time would decrease drastically.

What about a Set? Sets are not more than unordered collections of unique elements. And good old Python even has an implementation for that, which behind the scenes uses dictionaries. The code changes to make use of Sets are minimal, check here. Let’s run this run and find out how much we improved:

$ time python sets_problem23.py
4179871

real 0m12.476s
user 0m12.461s
sys 0m0.008s

Wow! We only changed two lines of code (not even entire lines!) and we got from 3m20s to roughly 12.5s. Now that’s a good improvement!

The lesson here? Using the right data structure makes all difference. It’s not only me that says that, Mr. Skiena says it too (read the book!).