BTreeHugger's Beat

Using SQL arrays for efficiency

Here is the situation. You have a database of items. Each item has a frequency count. For each item you want to count the number which have the same frequency, lower frequency, and higher frequency. The naive way of doing this is something like this (in pseudocode):

for each ITEM in select * from item_table:
  update items set  equal_freq_count = (select count(id) from item_table where frequency = ITEM.frequency),
                      lower_freq_count = (select count(id) from item_table where frequency < ITEM.frequency),
                     higher_freq_count = (select count(id) from item_table where frequency > ITEM.frequency)
               where                id = ITEM.id;

Anyone with any kind of database (or algorithms) experience knows this is deadly code and will take all day for a modestly-sized database. So here is the next logical evolution of the code, which makes much better sense algorithmically:

running_lower  := 0;
running_higher := count( id ) from item_table;

for each FREQUENCY_CLASS in select count(id) as count,frequency from item_table group by frequency order by frequency:

  running_higher := running_higher - FREQUENCY_CLASS.count;

  update item_table set num_lower_frequency = running_lower,
                           num_equal_frequency = FREQUENCY_CLASS.count,
                          num_higher_frequency = running_higher
                where                frequency = FREQUENCY_CLASS.frequency;

  running_lower := running_lower + FREQUENCY_CLASS.count;

This takes our running time down from hours to minutes; for a database of 8000 items it was about 30m on my ancient development machine. However, slightly more experienced DBAs should tell you that running one update per frequency class is still pretty horrible.

The key lessons here are to use your RAM and to avoid as many updates as possible. Even if it seems like you will be gobbling up a lot of memory this approach is likely to be much better (especially since this is the type of operation that is unlikely to be happening on a regular basis). In this case you will be reducing the number of updates to just one:

lower_array  := Integer[];
equal_array  := Integer[];
higher_array := Integer[];

running_lower  := 0
running_higher := count( id ) from item_table;

for each FREQUENCY_CLASS in select count(id) as count,frequency from item_table group by frequency order by frequency:
  running_higher := running_higher - FREQUENCY_CLASS.count;

   lower_array[ FREQUENCY_CLASS.frequency ] = running_lower;
   equal_array[ FREQUENCY_CLASS.frequency ] = FREQUENCY_CLASS.count;
  higher_array[ FREQUENCY_CLASS.frequency ] = running_higher;

  running_lower := running_lower + FREQUENCY_CLASS.count;

update itemtable set num_lower_frequency = lower_array[ frequency ], num_equal_frequency = equal_array[ frequency ], num_higher_frequency = higher_array[ frequency ];

This will be almost instant compared to the bombardment of updates from the other solutions. Here is where SQL really flexes its muscle. Of course you may be concerned about momentarily sacrificing a lot of RAM, but you can always divide-and-conquer to keep things sensible.

The real victory here is that the query itself takes all of 5 seconds to perform on an 8000-item database. And that's on my ancient system, which isn't even remotely close to the little leagues of production database hardware. With that kind of performance, you might even be able to get away without storing the results at all. The lesson here is that you have to find the balance between CPU, RAM and Hard Drive resources. Learn these lessons well, my fellow naive database-goers.

In case anyone wants a less pseudocodian implementation, here is the above algorithm in quick PL/pgSQL (PostgreSQL):

CREATE OR REPLACE FUNCTION compute_frequency_partitions() RETURNS VOID AS $$
DECLARE

  num_rows RECORD;
  frequency_class RECORD;

  lower_array  INTEGER[];
  equal_array  INTEGER[];
  higher_array INTEGER[];

  running_lower  INTEGER = 0;
  running_higher INTEGER;

BEGIN

  FOR num_rows IN SELECT COUNT( id ) AS count FROM item_table LOOP
    running_higher := num_rows.count;
  END LOOP;

  FOR frequency_class IN SELECT COUNT( id ) AS count, frequency FROM item_table GROUP BY frequency ORDER BY frequency LOOP

    running_higher := running_higher - frequency_class.count;
  
     lower_array[ frequency_class.frequency ] = running_lower;
     equal_array[ frequency_class.frequency ] = frequency_class.count;
    higher_array[ frequency_class.frequency ] = running_higher;
  
    running_lower := running_lower + frequency_class.count;

  END LOOP;

  UPDATE item_table SET num_lower_frequency = lower_array[ frequency ], num_equal_frequency = equal_array[ frequency ], num_higher_frequency = higher_array[ frequency ];

END;
$$LANGUAGE plpgsql;