Misk Cron: Cluster-Safe Scheduled Jobs

Series: Building Production Services with Misk — Part 19 of 24

Eventually every service grows a job that has to run on a clock: expire stale sessions at 2 a.m., roll up yesterday’s metrics, sweep a dead-letter queue every five minutes. The naive instinct is to reach for a ScheduledExecutorService and a @PostConstruct. That works beautifully on your laptop and detonates in production, because your service isn’t one process. It’s a deployment with three, ten, fifty replicas, and every single one of them dutifully fires the same cron at the same instant. Now you’ve expired the same sessions three times and emailed the same customer ten. Misk cron exists to solve exactly this: define a scheduled job once with cron syntax, and let the framework guarantee that across the whole cluster, the job runs on one pod. Let’s wire it.

The cluster problem cron actually solves

A timer in a single JVM is trivial. The hard part of scheduled work in a distributed service is coordination — making N identical replicas agree that only one of them should do the thing. Misk cron’s entire reason for existing is that coordination layer, built on leases. You write a plain Runnable, annotate it with a cron expression, and the framework handles the “who runs it” question that would otherwise be a distributed-systems footgun.

Be honest with yourself about one thing up front, because the module’s own README is: this is not an at-least-once executor. From the docs verbatim: “tasks can be delayed or missed entirely for many reasons, including if the instance currently holding the lease is degraded or if it dies completely while executing the task.” If a job is business-critical and must not be skipped, misk cron is the wrong tool and you want Temporal or a real durable workflow engine. Misk cron is for the large, useful middle: housekeeping, rollups, and sweeps you’d like to happen on schedule and can tolerate occasionally missing.

Defining a cron job

A cron job is just a class implementing java.lang.Runnable, annotated with @CronPattern:

@Target(AnnotationTarget.CLASS)
annotation class CronPattern(val pattern: String)

That’s the whole annotation — one string, applied to a class. The string is standard five-field Unix cron syntax: minute hour day month weekday. No seconds field, no Quartz-style sixth field; resolution is one minute, which we’ll come back to. Here’s a real one, adapted from the exemplar:

@Singleton
@CronPattern("0 2 * * *")   // every day at 02:00
class ExpireSessionsCron @Inject constructor(
  private val sessions: SessionStore,
) : Runnable {
  override fun run() {
    val purged = sessions.deleteExpired()
    logger.info { "Expired $purged sessions" }
  }

  companion object {
    private val logger = getLogger<ExpireSessionsCron>()
  }
}

Note what this gives you for free: it’s a normal injectable Singleton, so your job gets constructor injection like any other component — stores, clients, clocks, whatever. The run() body is ordinary blocking code. The @CronPattern annotation’s KDoc spells the syntax out, with examples worth keeping next to your keyboard:

"0 0 * * *" — daily at midnight
"*/15 * * * *" — every 15 minutes
"30 9 * * 1-5" — 9:30 AM on weekdays
"0 9-17 * * 1-5" — top of every hour, 9 AM–5 PM, weekdays

The pattern can also live on the registration instead of the class (more on that next), which is handy when the schedule is environment-specific and you don’t want it baked into the annotation.

Wiring it

Two installs and you’re done: a CronModule (once, configuring the cluster-wide behavior) and a CronEntryModule per job (registering each Runnable). The exemplar uses the fake variant for local dev, but the production shape is identical:

class MyAppCronModule : KAbstractModule() {
  override fun configure() {
    install(CronModule(zoneId = ZoneId.of("America/New_York")))
    install(CronEntryModule.create<ExpireSessionsCron>())
  }
}

The ZoneId is not decoration — cron expressions are evaluated in that zone, so "0 2 * * *" means 2 a.m. New York time, daylight-saving shifts and all. Pin it deliberately; the wrong zone is how “the nightly job” quietly becomes “the 9 p.m. job” for half the year.

CronEntryModule.create<T>() multibinds your runnable into the set of known cron entries. Its signature confirms the optional pattern override:

inline fun <reified A : Runnable> create(
  cronPattern: CronPattern? = null,
): CronEntryModule<A>

Pass null (the default) and the framework reads @CronPattern off the class. Pass one explicitly and it wins — useful for tests, or for schedules you’d rather configure than annotate.

A few real CronModule parameters worth knowing, straight from its constructor:

class CronModule @JvmOverloads constructor(
  private val zoneId: ZoneId,
  private val threadPoolSize: Int = 10,
  private val dependencies: List<Key<out Service>> = listOf(),
  private val installDashboardTab: Boolean = true,
  private val useMultipleLeases: Boolean = false,
)

threadPoolSize (default 10) bounds how many jobs run concurrently. Jobs execute on a fixed thread pool, so a slow job can starve others if you under-size it. dependencies lets cron wait for other services to be ready before it starts firing (the exemplar makes its cron depend on a DependentService, so jobs never run against a half-initialized app). installDashboardTab wires a Misk admin-dashboard tab listing your crons and letting you trigger them by hand. And useMultipleLeases — the most consequential flag here — controls cluster coordination, which is the whole ballgame.

How scheduling coordinates across a cluster

Here’s the part that earns the module its keep. A background task — CronTask, a RepeatedTaskQueue job — wakes up on a fixed interval and asks the CronManager to run anything that’s due:

val INTERVAL: Duration = Duration.ofSeconds(60L)

Every 60 seconds it computes which entries are due since the last tick. This is why resolution is one minute — there’s no point in a seconds field when the poll interval is a minute. But “due” is only half the gate. The actual decision in CronManager.runReadyCrons is:

if (nextExecutionTime.toInstant() <= now && cronCoordinator.shouldRunTask(cronEntry.name)) {
  runCron(cronEntry)
}

That second condition — cronCoordinator.shouldRunTask(...) — is the cluster-safety mechanism. Every replica runs this exact loop on its own 60-second timer; what stops them from all firing is that they each have to ask the coordinator for permission, and the coordinator is backed by a lease. The default coordinator:

class SingleLeaseCronCoordinator @Inject constructor(
  private val leaseManager: LeaseManager,
) : CronCoordinator {
  override fun shouldRunTask(taskName: String): Boolean {
    val lease = leaseManager.requestLease(CRON_CLUSTER_LEASE_NAME)  // "misk.cron.lease"
    return lease.checkHeld() || lease.acquire()
  }
}

One lease — "misk.cron.lease" — for the entire service. Exactly one pod holds it at a time; that pod answers true and runs the job, every other pod’s acquire() fails and it skips. So in the default single-lease mode, one elected pod runs all your crons. Simple operational model, single point of execution, impossible to double-fire under normal operation. The leasing handles the election for you. This is the same lease machinery that drives leader election generally, which is the subject of the next post.

Set useMultipleLeases = true and the coordinator changes to one lease per task:

class MultipleLeaseCronCoordinator @Inject constructor(
  private val leaseManager: LeaseManager,
) : CronCoordinator {
  override fun shouldRunTask(taskName: String): Boolean {
    val taskLease = leaseManager.requestLease("misk.cron.task.$taskName")
    return taskLease.checkHeld() || taskLease.acquire()
  }
}

Now different pods can hold different jobs’ leases, so your crons spread across the cluster instead of piling onto one elected pod. That’s better resource utilization and fault tolerance, but it means a job can now run concurrently with itself if two pods race a lease during a deploy, so the README rightly insists tasks be idempotent before you flip it. Pick single-lease when idempotency is uncertain or you have a handful of jobs; pick multiple-lease when you have many independent, idempotent jobs and want to fan them out.

Production notes & gotchas

It can drop tasks. Design for that. This is not at-least-once. A pod can die mid-run() holding the lease and that execution is simply lost — no retry, no resurrection. Make jobs resumable (next run picks up where the last left off) rather than assuming every tick fires.
Pin the ZoneId on purpose. Schedules evaluate in the configured zone, DST included. “Daily at 2 a.m.” silently drifts by an hour twice a year if you didn’t mean a fixed zone, and “midnight UTC” is a different time for everyone reading the logs.
Mind the thread pool. Jobs share a fixed pool (default 10). A job that blocks for minutes — the exemplar’s demo job deliberately sleeps 60 seconds — occupies a thread the whole time. Long or overlapping jobs can starve the pool; size it, or move heavy work off the cron thread.
One-minute resolution, polled every 60s. There is no sub-minute scheduling, and a due job can be delayed by up to a full poll interval. Don’t build anything that needs second-level precision on this.
Switching useMultipleLeases is a deploy-time hazard. The README is blunt: during a rolling deploy with old pods on cluster-wide leases and new pods on per-task leases, the same task can run on both. Migrate during a downtime window or only after you’ve made the task idempotent — it is not a runtime toggle.
Exceptions are swallowed, not propagated. runCron wraps your run() in try/catch and only logs on failure. A throwing job doesn’t crash the scheduler (good) and doesn’t retry or surface anywhere except logs (so wire your own alerting — log a metric on failure, don’t assume someone reads the log).

What’s next

Everything load-bearing about misk cron — the single-pod election, the per-task fan-out, the “exactly one holder at a time” guarantee — is really one primitive wearing a cron costume: the lease. In Part 20: Misk Distributed Leases & Leader Election we’ll pull that LeaseManager apart, see how a lease is acquired and held, what happens when a holder dies, and how to use leases directly for any “only one replica should do this” problem — not just scheduled jobs.

Target keywords: misk cron.