A Small Issue With SMTP

SMTP, or, the Simple Mail Transfer Protocol, is the communications protocol that email servers (Mail Transfer Agents, or MTAs) use to talk to one another to… transfer mail. And SMTP has one glaring issue: A rather ambiguous “try again later”.

SMTP is a lock-step protocol: The client sends one command, and the server will send one response, you cannot send another command until you’ve received and understood that first response.1 Responses come in a few forms, usually 2xx or 3xx for “positive” or “please continue” responses, and then 4xx and 5xx for errors. A 4xx response is temporary, it means “please try again later.” A 5xx response is permanent, it means “don’t attempt to retry, send an error to the sender and discard”. A 4xx response can happen for many reasons:

  • Recipient mailbox full
  • Recipient is being rate-limited
  • Security checks had an internal issue (like DNS error), please try again later
  • Couldn’t store the file for some reason
  • Mail server is experiencing heavy load

All of which make sense. In every good MTA, you’ll have some exponential back-off for this. Say, you get a 4xx code, and will wait for a minute, then try again, then wayt for two, then four, then 8… Until either you’ve hit a maximum number of attempts, and send a non-delivery notice back to the sender (called an email bounce), or the message finally sends. This is the fundamental issue, which, in my opinion, mostly exists because a thing called greylisting.

Greylisting (or, if you spell it differently, graylisting) is a security measure used by some services, which works like this:

  1. When receiving an email from somewhere, check if they’ve sent you a good message before. If so, take no action.
  2. Otherwise, reject with a temporary failure, and note the time you received a message.
  3. When you receive the same message again, if it’s after a certain delay, mark the message and sender as “good” and let the message through.

This is a pretty good anti-spam measure, since most spam systems are likely expecting a lot of delivery errors, and don’t attempt to hold and retry the temporary ones. Thus, by stopping their first (and only) attempt, you’ve stopped that spam message, although you could get more advanced by checking which hosts never retried and then adding them to a spam list. Additionally, the “had to wait X amount of time” stipulation is there to prevent immediate retries, which don’t really count in this instance. The default timeout for that is 5 minutes (300 seconds). Note that the 4xx response is allowed to contain arbitrary text, but this isn’t parsed by the sending MTA, if anything, it’s either logged, or sent to the user in a bounce message. This is the issue.

Starting backoff times can vary wildly from place to place. Some use times of around 5 minutes, some closer to one, and twitter is an outlier that will wait one entire hour after a failed delivery. Note that I said some are closer to one minute, and there’s multiple that are even shorter, like 30 seconds. These are also the ones that tend to stop delivery after 3 temporary failures.

A Proposed Solution

What SMTP needs is some way for the receiving MTA to, in the 4xx response, tell the sending MTA how long the error is expected to be temporary, if any. For example, instead of the error message:

450 4.2.0 <[email protected]>: Recipient address rejected: Greylisted; from=<[email protected]> to=<[email protected]> proto-ESMTPS helo=<example.com>

Make it:

450 4.2.0 TIME=300 <[email protected]>: Recipient address rejected: Greylisted; from=<[email protected]> to=<[email protected]> proto-ESMTPS helo=<example.com>

Add some keyword in there that tells the sending MTA exactly how long to wait, at minimum, before trying again. If the reason was, say, a user has received too many messages this day, try again tomorrow, you could specify a time of 86,400 seconds, or, 24 hours. And in a format like that which I just described, any other MTA that doesn’t support this would just see that as part of the error message, while ones that do could read and parse it out to mean “retry no earlier than after 300 seconds.”

Being 100% honest, it’s just, well, a small issue, nothing too important. While it would be cool to find enough reason to consider pushing an RFC for this very thing, well, I really doubt that it’s enough of an issue to really warrant anything more than one person, with a blog, complaining on the internet.

  1. The PIPELINING extension is a thing in ESMTP, but for now that’s irrelevant to the point at hand. ↩︎