
The Unasked Question
What was the question they didn’t ask?
Some years ago I was teaching a computer science class. I posed the class a problem.
You have an input stream of unsorted data that comes in blocks of 1MB. They are arriving at an average rate of 1 block every 10 seconds. You have to sort them and pass the results on to the next stage. How do you do this?
The class responded with a discussion about how long it took to do a sort. We know that a quicksort is O(NLogN) operations (where an operation is a comparison and possibly a interchange of item positions). So a million items will take in the order of 20,000,000 operations. For simplicity, assuming that an operation takes a nanosecond, then it will take about 20 milliseconds to sort the 1 million items.
That seemed doable. But then I said, what if the million goes to a billion. What then. Well that would be 30 billion comparisons which at 1 nanosecond each gives 30 seconds. Clearly this doesn’t keep up with one block of data per 10 seconds.
So now the class plunged into a lot of very complicated strategies. They could get a processor with many cores. Then they could have some kind of switching algorithm to assign each task to a different core.
Then I said what happens if we go to 10 billion items. With a calculated sort time of around 15 minutes each, it became unclear if there was a solution outside of specialized super-computer hardware.
But what was the question they weren’t asking?
They never asked about the data. How was it being generated? What was it for?
When you are doing enterprise software, or any software for that matter, you always have to understand the data. Where is it coming from? Under what conditions is it being prepared? What are its characteristics?
The largest gains in efficiency come from exploiting characteristics of the data.
If the students had asked, I would have told them that the data was coming from two machines. The sorting order was the time stamps which were stored as a long integer (64 bits) in the data. These machine were producing data samples and sending them to another computer that constructed the block of one billion samples and sent it on. The timestamps were almost always in order but once in a while there was a timing issue where the two machines put a sample in at the same time and sometimes the one with the lower timestamp got inserted after the higher one.
So in this case our block of a billion samples is mainly sorted with occasional transpositions.
This is a whole different case and can be solved with a lot simpler approach.
public class Sorter
{
// construct a sample with a -1 time
private Sample lastValue = new Sample(-1L); public sort(Sample sample)
{
if (sample.time > lastValue.time)
{
send(lastValue);
lastValue = sample;
}
else
{
send(sample);
} }
All it takes is some code to check the order and fix it in the few cases when the samples are out of order. Of course all this depends on how reliable our assumption of the data characteristics are. If there is even the slightest chance that a sample could be more than one removed from its correct order then this code would have to be changed to accommodate that. In fact, in real code, you would definitely put in a check for this and throw an exception.
Because the out-of-order situations are rare, that means that pretty much every sample does one comparison. So that means we only have a billion comparisons which can be done in the order of a second. We can do ten billion in ten seconds without the use of a super computer.
All of the conventional wisdom says don’t write your own sort. In most cases that is exactly right. If you just have to sort some random data then a built-in sort in your particular class library is definitely the way to go.
However, we can see that in this case that writing our own sort is absolutely the way to go. By using the special knowledge we have of the data we can improve the sort from o(NlogN) to o(N) and take the problem from something that would require an elaborate hardware/software combination solution and change it to a straightforward algorithm.
One just can’t throw blanket commandments around like: never write your own sort. Or even: never write your own database.
You’ve got to ask the right questions and do what is best for the situation you have.
✉️ Subscribe to CodeBurst’s once-weekly Email Blast, 🐦 Follow CodeBurst on Twitter, view 🗺️ The 2018 Web Developer Roadmap, and 🕸️ Learn Full Stack Web Development.