Skip to content

Query to get last update of a dataset can be quite slow #263

@Kjir

Description

@Kjir

The query to get the last update of a dataset can be quite slow, especially once the number of time series in the dataset grow above 90'000.

I wrote an alternative query that seems to perform slightly better, but there probably needs to be an additional index somewhere to make it really go fast.

The existing query is:

-- Get the last time the dataset was updated
--
-- Returns the created_at timestamp of the series in the given dataset
-- that was most recently updated.
CREATE OR REPLACE FUNCTION timeseries.dataset_get_last_update(p_dataset TEXT)
RETURNS TABLE(name TEXT, updated TIMESTAMPTZ)
AS $$
BEGIN
  CREATE TEMPORARY TABLE tmp_ts_read_keys
  ON COMMIT DROP
  AS (
    SELECT ts_key
    FROM timeseries.catalog AS cat
    WHERE set_id = p_dataset
  );

  RETURN QUERY
  SELECT p_dataset AS name, max(ud.updated) AS updated
  FROM timeseries.ts_get_last_update() AS ud;
END;
$$ LANGUAGE PLPGSQL
SECURITY DEFINER
SET search_path = timeseries, pg_temp;

-- Get the last time time series were updated
--
-- Returns the created_at of the given time series
CREATE OR REPLACE FUNCTION timeseries.ts_get_last_update()
RETURNS TABLE(ts_key TEXT, updated TIMESTAMPTZ)
AS $$
BEGIN
  RETURN QUERY
  SELECT
  DISTINCT ON(mn.ts_key)
  mn.ts_key, mn.created_at AS updated
  FROM timeseries.timeseries_main AS mn
  JOIN tmp_ts_read_keys AS rd
  USING(ts_key)
  ORDER BY mn.ts_key, mn.created_at DESC;
END;
$$ LANGUAGE PLPGSQL
SECURITY DEFINER
SET search_path = timeseries, pg_temp;

The query I wrote is the following:

SELECT MAX(updated) updated
  FROM (
    SELECT
       ts_key,
      (
         SELECT MAX(created_at) updated
         FROM timeseries.timeseries_main tm
         WHERE tm.ts_key = catalog.ts_key
       )
    FROM timeseries.catalog
    WHERE set_id = '<set_id>'
  ) AS s;

The timings with the ch.fso.cah.inv dataset, which contains 69'366 time series, is as follows:

image

The difference in time between the first query and the following ones is probably due to Postgres caching the table in memory after the first read, but the time when used in actual code would probably be closer to the first run.

The timings with the updated query are as follows:

image

Some data

Set N. series Time old Time new
ch.kof.ivua.ng08 120'742 First Run: ~270s , Cached: ~270s FR: ~49s, C: ~6s
ch.kof.ivus.ng08 97'595 FR: ~245-250s, C: ~245-250s FR: ~13s, C: ~4s
ch.fso.cah.inv 69'366 FR: ~4.5s, C: ~2.9s FR: ~2s, C: ~2s
ch.fso.hesta 4'212 FR: ~1.6s, C: ~1.6s FR: ~1.2s, C: ~1.2s
ch.fso.unemp 84 FR: ~1.2s, C: ~1.1s FR: ~1.2s, C: ~1.1s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions