Why you shouldn't use delayed_job_active_record with PostgreSQL

First

The delayed_job is job worker system for Ruby. The delayed_job_active_record is an adapter for delayed_job worker to connect RDBMS using ActiveRecord. The PostgreSQL is one of RDBMS supported by delayed_job_active_record.

In this article, architecture of delayed_job is out of scope. And, I don’t compare any other similar system.

Problem 1. A Worker Takes All Jobs

The delayed_job_active_record allows to choose SQL strategy from two. The default strategy is :optimized_sql. It looks efficient. But it will occur problem.

A worker sometimes takes all jobs in queue at that time. Normally, a worker takes a job at a time. It will be delay job performing if a worker can take many jobs at a time.

Cause

The :optimized_sql strategy build following SQL.

UPDATE "delayed_jobs"
   SET locked_at = '{Current Time}'
     , locked_by = '{This Worker Name}'
 WHERE id IN
   (
     SELECT "delayed_jobs"."id"
       FROM "delayed_jobs"
      WHERE
        (
          (
            run_at <= '{Current Time}' AND
            (
              locked_at IS NULL OR
              locked_at < '{Max Run Time Ago}'
            ) OR
            locked_by = '{This Worker Name}'
          ) AND
          failed_at IS NULL
        )
      ORDER BY priority ASC, run_at ASC
      LIMIT 1
        FOR UPDATE
   )
RETURNING *

The subquery will return a row using LIMIT 1. But, it sometimes return multiple rows. Why?

The planner may choose to generate a plan that executes a nested loop over the LIMITing subquery, causing more UPDATEs than LIMIT sql - Postgres UPDATE ... LIMIT 1 - Database Administrators Stack Exchange

The PostgreSQL will cause this problem.

So, you shouldn’t use :optimized_sql if you use delayed_job_active_recoed with PostgreSQL.

Problem 2. Workers Dispatch Same Job

You can also choose an old fashion :default_sql strategy.

It is usually slower but works better for certain people. collectiveidea/delayed_job_active_record: ActiveRecord backend integration for DelayedJob 3.0+

It is not truth about PostgreSQL. Some workers sometimes takes same job if it is locking. It means duplicate execution. Normally, a locked job will not execute by other worker.

Cause

The :default_sql strategy build following SQL.

UPDATE "delayed_jobs"
   SET "locked_at" = '{Current Time}'
     , "locked_by" = '{This Worker Name}'
 WHERE "delayed_jobs"."id" IN
   (
     SELECT "delayed_jobs"."id"
       FROM "delayed_jobs"
      WHERE
        (
          (
            run_at <= '{Current Time}' AND
            (
              locked_at IS NULL OR
              locked_at < '{Max Run Time Ago}'
            ) OR
            locked_by = '{This Worker Name}'
          ) AND failed_at IS NULL
        ) AND "delayed_jobs"."id" = {Job ID}
      ORDER BY priority ASC, run_at ASC
   )

This SQL is not lock in sub query. So it allows update same record by concurrent sessions if other session set locked_at field before.

So, you shouldn’t use :default_sql if you use delayed_job_active_recoed with PostgreSQL.

Solution?

If you want to keep using delayed_job_active_record with PostgreSQL, I show two solutions.

  • Choose :optimized_sql (If you can accept delay)
  • Write SQL yourself (If you are SQL expert)

I think you may resolve problem re-writing SQL using WITH clause.

Following SQL is a sample. But, I don’t guarantee correctness. It just only “SAMPLE”.

WITH cte AS (
  SELECT "delayed_jobs"."id"
    FROM "delayed_jobs"
   WHERE
     (
       (
         run_at <= '{Current Time}' AND
         (
           locked_at IS NULL OR
           locked_at < '{Max Run Time Ago}'
         ) OR
         locked_by = '{This Worker Name}'
       ) AND
       failed_at IS NULL
     )
   LIMIT 1
     FOR UPDATE
)
UPDATE "delayed_jobs" AS t
   SET
     locked_at = '{Current Time}'
   , locked_by = '{This Worker Name}'
  FROM cte
 WHERE t.id = cte.id
RETURNING *

Appendix

Why don’t you send Pull Request?

The delayed_job_active_record has opened some PRs since 2013. It seems like an inactive project. And this bug reported to delayed_job a year ago, but reacted a few peaple. Further, I think delayed_job is not popular software now.

So I’m not motivated to send PR to resolve this problem…