|
ABSTRACT
The term "spam", in the context of electronic communication, is typically
understood to mean:
Spam: An unsolicited message sent to a large number of recipients.
|
It is commonly believed that a spam message for a given product or service
sent to thousands or millions of people will generate very few actual customers
for the sender of the message. Even if the ratio of attracted customers to
total spam message recipients is only 1/10000, or even much lower, the overhead
costs of spamming is so low that a net profit is possible. In fact, for many
of the spam products, selling a single product, or scamming a single victim,
may be the break-even point for the business model.
Several factors contribute to the perception of the spam phenomenon as a
pressing social and technological problem:
(1) Spam wastes millions of hours of humanity's time each day
with the task of differentiating between spam messages and
"legitimate" messages;
(2) Spam consumes a significant fraction of total Internet
bandwidth, which causes both a slowdown of other traffic,
and possibly raises overall bandwidth cost;
(3) Spam consumes a large amount of storage space on mail
servers, sometimes actually making it temporarily impossible
for "legitimate" messages to be received;
(4) Spam can be the vehicle of "identity theft" campaigns, other
types of fraud, or virus propagation;
|
This article describes the flaws with many of the simplistic methods proposed
or employed in past attempts to reduce the levels of the factors listed above.
A more robust solution is proposed.
|
The Original SPAM!
SPAM is a pork and ham food product produced by Hormel Foods
(http://www.hormel.com):
"SPAM Classic is a conveniently packaged canned meat product made of
100 percent pure pork and ham. SPAM Classic contains 180 calories per
two-ounce serving. SPAM luncheon meat first was produced in 1937.
It was one of the first convenient, moderately priced and great
tasting meat products on the market."
|
Hormel Foods has written an article commenting on the use of the term "spam" to
refer to unsolicited e-mail: http://www.spam.com/ci/ci_in.htm
From the article:
Ultimately, we are trying to avoid the day when the consuming public
asks, "Why would Hormel Foods name its product after junk e-mail?"
|
Briefly, the use of the word "spam" to mean "driving insane with relentless,
monotonous bombardment" is directly attributed to a "Monty Python's Flying
Circus" (humorous BBC television series) skit that essentially celebrates SPAM.
A restaurant patron discovers, with chagrin, that everything on the menu
contains SPAM -- such as: "[...] spam spam spam egg and spam; spam spam spam
spam spam spam baked beans spam spam spam...". The mention of spam rouses
Viking restaurant patrons to break in to song: "Spam, spam, spam, spam.
Lovely SPAM, wonderful SPAM!", driving the frustrated restaurant patron mad.
A detailed article on the "Origin of the term 'spam' to mean [network] abuse"
was written by Brad Templeton: http://www.templetons.com/brad/spamterm.html
|
REASON FOR THIS ARTICLE
I am very concerned by various "solutions" to the spam phenomenon that involve:
(1) Invasion of Privacy;
(2) Censorship;
(3) Payment and Cooperation with a Commercial Entity.
|
The very day I started writing this article (2004 March 29th) I heard a report
on the BBC World Service (rebroadcast on a local public radio station)
featuring an interview with a person affiliated with a company that will offer
a "new kind of spam filtering" on a paid membership basis. The method relies
on monitoring Internet traffic, searching for "identical" e-mail messages
coming from a common source. Suspected spam e-mail is analyzed further to
discover any links to sites previously associated with spam efforts. This
service will fail due to various scenarios described in this article. However,
my biggest concern when hearing proposals like this is that the public will
embrace the proposed solution without fully considering issues of invasion of
privacy, censorship, and corporate interests.
Clients of various online services are subject to the contracts of the service
providers. I have no complaint with that because I can choose to avoid
service providers with terms I do not like. My concern is that the current
conversations about solutions to the spam phenomenon will lead to a wide
acceptance of terms that go against principles I consider important -- and I
believe a significant fraction of the people who would willingly accept such
terms might not be so accepting if the impact of such terms on privacy,
freedom of communication, and freedom from the influence of corporate
interests, are described in a way that makes the issues very relevant and
personal.
|
GALLERY OF SPAM
This section presents contemporary examples of spam, with some analysis and
related information. Although this section is based on spam I have personally
received, I believe my experience is typical of users of e-mail.
This section is intended to sketch the basic principles of spam. An attempt at
a formal definition of the term "spam" will be postponed until the next section.
Presenting examples in this section will make subsequent formal discussion less
abstract.
|
1. My E-mail "Inbox"
Over the past few months I have received an average of roughly 100 spam messages
per day, and I generally receive several viruses as e-mail attachments each day.
Earlier this year, from 2004 January 15th through 2004 February 8th (25 days),
I received 2872 spam messages, of which 207 were viruses; which corresponds to
an average of 114 spam messages each day, and an average of 8 virus attachments
per day.
|
FIGURE: A portion of my e-mail "Inbox" on 2004 March 29th as manifested by the
"Microsoft Outlook Express 5" application. On this date I received
9 "legitimate" messages, 77 spam messages, and 2 virus attachments.
SENDER NAME AND SUBJECT:
========================
One of the striking features of most spam messages is that the disingenuousness
starts almost immediately with the alleged sender's name. The fact that almost
every spam message has a fake sender name cheapens the whole concept of the
sender name. Of course that is just the beginning of the erosion of trust, but
I nonetheless pause and consider the bizarre act of a spammer producing a fake
sender name. Spam messages promoting "male sexual performance" drugs and
pornography often have sender names that are female.
Interestingly, the subject associated with a spam message often really does
contain an accurate summary of the spam message. But, as one can see in the
small set of subject items above, some spammers believe it is possible to do
without sensible descriptions of e-mail content.
Eventually both the sender name and subject line will be recognized by the
public at large as totally meaningless claims associated with the messages,
which is a reflection of the actual technical fact: these fields are totally
unreliable for determining the origin and content of e-mail messages.
|
WHAT IS ALL THE SPAM ABOUT?
===========================
I did a simple analysis of the spam messages received on three recent dates:
(1) 2004 March 29th : 77 spam messages total;
(2) 2004 March 30th : 98 spam messages total;
(3) 2004 March 31st : 121 spam messages total;
|
The following is a rough classification of the messages received:
MEDICATION:
-------------------------------------------------------------------
PENIS-ENLARGEMENT:
Viagra, Cialis, NaturalGain,
"Weekend Pill", Viagra Patch: 18/77, 17/98, 16/121
ALTERNATIVE-SOURCE PRESCRIPTION
MEDICATIONS/PSYCHOTROPIC DRUGS:
Levitra, Phentermine, Vicodin,
Valium, Ambien, Xanax, Tramadol,
Lipitor, Propecia, Zocor: 14/77, 18/98, 19/121
Marijuana-like product/
Mood Enhancers/Herbal Meds: 1/77, 0/98, 0/121
DIET/NUTRITION:
Diet Pills/Patch: 3/77, 3/98, 3/121
Anti-Aging/HGH: 1/77, 0/98, 1/121
SMOKING:
Cigarettes: 1/77, 1/98, 3/121
HEALTH AID:
Snoring Control: 1/77, 0/98, 0/121
-------------------------------------------------------------------
TOTAL: 39/77(50%), 39/98(40%), 42/121(35%)
|
FINANCIAL:
-------------------------------------------------------------------
LOANS/CREDIT:
Refinance Mortgage/Equity Loan: 13/77, 12/98, 11/121
"Cancel Debt" (somehow): 0/77, 1/98, 8/121
Car Loans: 0/77, 2/98, 1/121
Payday Cash Advance: 1/77, 1/98, 0/121
Unsecured MasterCard/Credit: 1/77, 0/98, 1/121
INVESTING:
Investor/Stock Alert: 5/77, 5/98, 3/121
INSURANCE:
Life Insurance: 1/77, 1/98, 2/121
Healthcare: 1/77, 0/98, 0/121
Auto/Warranties: 1/77, 0/98, 0/121
BUSINESS OPPORTUNITIES:
"Work" on eBay: 1/77, 6/98, 4/121
Own Resort: 1/77, 0/98, 0/121
"Network Marketing": 0/77, 0/98, 1/121
Real-Estate Auctions: 0/77, 0/98, 1/121
GAMBLING:
Poker/"Earn Money Playing Lotto!": 0/77, 1/98, 2/121
SPAMMING:
Spam 27 million people: 0/77, 1/98, 0/121
-------------------------------------------------------------------
TOTAL: 25/77(32%), 30/98(31%), 34/121(28%)
|
SOFTWARE/CONTENT:
-------------------------------------------------------------------
PORNOGRAPHY:
Porn (farm sex, schoolgirls,
girls gushing, web cam,
monster cocks): 1/77, 1/98, 6/121
PARANOIA/SNOOPING:
Software to Learn about People: 1/77, 0/98, 0/121
Scan PC: 1/77, 0/98, 0/121
Keyboard Logger: 0/77, 1/98, 0/121
PIRACY:
Cheap software/OS: 2/77, 8/98, 5/121
DVD copying: 0/77, 2/98, 0/121
Cable Descrambling/
Free "Pay-Per-View"(!): 0/77, 2/98, 0/121
-------------------------------------------------------------------
TOTAL: 5/77(6%), 14/98(14%), 11/121(9%)
|
MALICIOUS/FRAUD:
-------------------------------------------------------------------
VIRUS:
Virus (Mail "Delivery Failed" type
with attachment): 2/77, 0/98, 1/121
IDENTITY THEFT:
Web-based "verification"
(PayPal,eBay,Fleet Bank): 2/77, 2/98, 0/121
-------------------------------------------------------------------
TOTAL: 4/77(5%), 2/98(2%), 1/121(1%)
|
MISCELLANEOUS:
-------------------------------------------------------------------
Unknown: 2/77, 6/98, 18/121
Blind date/dating: 0/77, 0/98, 5/121
Earn Degree/Degree without Tests: 0/77, 1/98, 3/121
"Colin, Grow 2 Cup Sizes -- FREE!",
Bigger Breast From Pill: 0/77, 1/98, 2/121
Vacation Deals: 1/77, 1/98, 0/121
Your Opinions might make you 1000: 0/77, 1/98, 1/121
Hair Transplants: 0/77, 1/98, 1/121
Misc. Deals: 1/77, 0/98, 0/121
Luxury Sheets: 0/77, 1/98, 0/121
Free Samsung Mobile Phone: 0/77, 1/98, 0/121
Hypnotic MP3 for Depression,
Self-Esteem, Motivation: 0/77, 0/98, 1/121
Wristwatches (Rolex,etc): 0/77, 0/98, 1/121
Print Own Postage: 0/77, 0/98, 1/121
-------------------------------------------------------------------
TOTAL: 4/77(5%), 13/98(13%), 33/121(27%)
|
SUMMARY:
-----------------------------------------------------------------------
MEDICATION TOTAL: 39/77( 50% ), 39/98( 40% ), 42/121( 35% )
FINANCIAL TOTAL: 25/77( 32% ), 30/98( 31% ), 34/121( 28% )
SOFTWARE/CONTENT TOTAL: 5/77( 6% ), 14/98( 14% ), 11/121( 9% )
MALICIOUS/FRAUD TOTAL: 4/77( 5% ), 2/98( 2% ), 1/121( 1% )
MISCELLANEOUS TOTAL: 4/77( 5% ), 13/98( 13% ), 33/121( 24% )
-----------------------------------------------------------------------
TOTAL: 77/77(100%*), 98/98(100%*), 121/121(100%*)
(*...Percentages in this table are rounded and do not
add to 100% with shown precision.)
|
ANALYSIS:
=========
Medication is the most frequent topic of spam messages during this three-day
sample. Two types of medication supply services dominate in this category of
spam messages: (1) Penis enlargement; (2) General pharmacy "needs" (often drugs
that are expensive in the domestic US market, and drugs which reputable doctors
may be hesitant to prescribe due to lack of medical justification and potential
for abuse). Spam promoting penis-enlarging drugs are typically very informal,
using phrases like: "Haha, U Have A Real Small Pe-nis", "Is Your Me.mber too
Teeny?", "Screw ur lover like never before", etc.
Financial topics were very common among the spam messages during this three-day
sample. Home mortgage loans and refinancing offers dominate this category of
spam messages. Investor "stock alerts" are also common. During this period,
the "making a fortune on eBay" plan was significantly promoted. My personal
favorite scam concept in this category arrived with the subject: "Earn Money
Playing Lotto!".
Software and media content are popular spam topics. Offers of inexpensive
software dominate this category; there is no doubt that this software is
pirated, despite explanations of how, for example, one can buy Windows XP for
$32 USD instead of paying $286 USD. Spam promoting pornographic web sites is
also common in this category. My personal favorite offer is for a product that
will give a person "Free [Pay-Per-View]" -- an oxymoron if one doesn't
consider the fact that the product itself actually costs money. Another really
interesting sub-category in spam regarding software products is software
designed to address a person's paranoia -- such as software to scan a person's
PC for spyware, or software to spy on children and spouses using the family
computer, or software to learn about public records on others (or oneself!).
The irony is that installing such software will lead to the very things the
target spam recipients fear most.
Of the miscellaneous topics of other spam messages, alleged "blind dates" are
frequent, along with offers to earn various degrees (often just by paying a
small fee; no testing or qualifications necessary!). My personal favorite is
an offer with the subject: "Colin, Grow 2 Cup Sizes -- FREE!".
|
2. Notable Spam from the Years 2001-2003
The following images were taken from spam messages I received during the period
2001-2003.
|
FIGURE: I received this spam message on 2001 September 21st, 10 days after the
World Trade Center buildings were destroyed by fires following plane
crashes directly caused by terrorists.
This spam message, offering, among other things, a bumper sticker that
advocates a plan to "Nuke Afghanistan", demonstrates that spam can be
very political. Following the US initiation of the war on Iraq in 2003,
spam offering "Terrorist 'Most-Wanted'" playing cards, depicting 52
people targeted by the US anti-terrorism effort, arrived almost daily
in my e-mail "Inbox" for many months.
It is important to consider politically-motivated spam, or spam efforts
that seek to profit on hate.
FIGURE: This creepy spam message, like most spam messages, addresses some sort
of personal insecurity. This same product, in a funny coincidence,
was also promoted as a way to spy on naked women -- essentially
advocating a means to violate someone else's personal security.
FIGURE: This spam message, which I received some time in the year 2002, makes
an indirect reference to the film "The Matrix":
[Morpheus offers Neo a choice between two pills, one blue and one red.]
"You take the blue pill, the story ends, you wake up in your bed and
believe whatever you want to believe."
"You take the red pill...and I show you how deep the rabbit hole goes."
Although in the film it is the red pill (not the blue pill) that will
result in being shown "how deep the rabbit hole goes", the humor of
this Viagra spam message is hardly diminished.
Compared to penis-enlargement product spam messages of 2003 and 2004,
this spam message is fine art!
FIGURE: This spam message, which I received some time in the year 2002, is the
most outrageous invitation for irony I have ever seen.
The idea of promoting a sense of security by installing an application
that is completely invisible and secretly records all instant messages,
chat, e-mail, web sites, etc, is perverse. Naming the application
"IamBigBrother" is hilarious!
FIGURE: Apart from the fact that the product one might receive after responding
to this offer is likely to actually contain viruses rather than
prevent them, I chuckle at the hypocrisy of a spam message that
proclaims: "Norton Spam Alert filters unwanted e-mail!"
3. Non-technical spam examples from 2004
The following images were taken from spam messages I received during the year
2004, illustrating miscellaneous non-technical points.
|
FIGURE: This spam message is, apparently, an invitation to start a career
in sending spam messages! (Of course visiting the specified web site
address could lead to viruses, such as spyware or a trojan mailer.)
I can't help it: I love the bravado of the author of this message!
My sober judgment says this person's position is very wrong, but
there is something very human about this message that resonates
with me.
FIGURE: This spam message promotes a service to allow one to send spam messages
to "27 million people". I suppose receiving this message itself lends
some (i.e., 1/27000000th) credibility to the sender's spamming
capability.
FIGURE: I like the domain name: "YetAnotherDomainName.com" (created 2004
January 29th, and resolving to 216.177.88.181 at the time of this
writing, 2004 April 3rd.)
This domain name actually captures the spirit in the spammer realm,
where purchasing new throw-away domain names to launch the next spam
campaign is a small price to pay to avoid spam obstacles, like
IP blacklisting or cancellation of service by the web hosting provider
(who discovers, too late, that a host was rented for use in a spam
effort).
FIGURE: I like this one; just sign up and start making money -- while doing
"Absolutely Nothing!". Scams are based on greed, and this example is
one of the purest appeals to greed I've ever seen.
FIGURE: This spam message promotes a book whose only possible positive use is
by law-enforcement officers trying to keep up with criminals.
I guess I might find such a book intellectually interesting, much like
the articles in "2600" magazine, and I support being able to freely
share any ideas whatsoever, but these facts do not make this book, and
others like it, any less repulsive to my ethical sensibilities.
How lonely and hostile a world it must be for a person to seriously
consider the subjects in such publications.
Some of the items seem somewhat contradictory: "Surf Internet
Anonymously" while also demonstrating that one can "Hack into other
[people's] computers remotely"; Or, "How to get fake identity
documents", but demonstrating "tracing and tracking people [...]
to find those who don't want to be found".
FIGURE: This spam message teases me with one of the central mysteries of spam:
How can anyone buy medication (which enters and affects one's very
body) from a business that thinks it is okay to use a fake sender name,
use a humiliating taunt as a subject, and make numerous spelling errors
in the promotion, and end with a collection of random words? (If a
person is savvy enough to recognize the fact that the misspellings and
random words are methods of evading spam filters, all the more reason
to mistrust the business attempting to sell the medication!)
But, on the other hand, I am baffled by the state of mind of the
spammers who send messages like these. What world-view makes it okay
to send messages like these? (NOTE: I strongly agree with the
principle of free communication, but I still wonder why people choose
to express certain ideas.)
4. "Identity Theft" examples from 2004
The following images were taken from spam messages I received during the year
2004, illustrating "identity theft" efforts.
The basic concept is to convince the message recipient that it is necessary to
gather personal information, often to "prevent an account from expiring", or
for the recipient's "security" and "protection".
|
FIGURE: This spam message is admirable in its professional appearance and for
its outrageous inclusion of phone numbers and web site addresses to
help the victim gather information to be scammed more efficiently
and completely.
FIGURE: The PayPal scam e-mail message in the previous figure includes the
JavaScript code directly above -- which repeatedly writes the text
"http://www.paypal.com" to the browser status bar (lower-left border
or Internet Explorer, for example). Thus, when the user hovers the
mouse over the critical links in this spam message, the actual link
(which would be a potential give-away that this is a scam) is
quickly clobbered by the text "http://www.paypal.com". Only someone
watching the status bar intently while moving the mouse on and off
the various links (without clicking) would see the brief flash of
the real target web site address.
FIGURE: This "identity theft" scam, allegedly from Citibank, which I received
on 2004 April 4th, makes a direct request for a debit-card PIN
(Personal Identification Number), which can be used to withdraw cash.
This must be a psychology experiment, or a massive IQ test, or perhaps
the "Spam & Bologna" contest (in progress at the time of this writing).
FIGURE: The "Citibank" scam spam message in the previous figure has the HTML
code shown directly above.
FIGURE: This eBay scam isn't as elaborate as the PayPal scam above, but it
probably looks sufficiently professional to be effective.
5. Virus attachment examples from 2004
The following images were taken from spam messages I received during the year
2004, illustrating virus attachments. If I had a spare computer I'd be
tempted to download as many viruses as possible and have all the viruses
fight it out for control of computer's resources. "Ready...FIGHT!"
|
FIGURE: I have to say: This e-mail with virus attachment is a bit of a
contemporary classic. I've never actually tried the virus out, you
know, to see if it fit my lifestyle, but, hey, "500.000" people
can't be wrong!
FIGURE: Wow, this virus attachment has quite a background story!
6. Trivial obfuscation example from 2004
The following images were taken from spam messages I received during the year
2004, illustrating use of trivial obfuscation.
|
FIGURE: This is a trivial form of obfuscating the content of HTML to thwart
text-based e-mail message content filters. The dummy HTML tags
break up the text that will ultimately appear in the HTML document.
The solution to this direct problem is to eliminate or consider the
effect of HTML tags before scanning for spam-indicating words, but
this is just the beginning of the hopelessness of content scanning,
as will become more evident in the subsequent sections of this article.
7. Unicode examples from 2004
The following images were taken from spam messages I received during the year
2004, illustrating Unicode use.
|
FIGURE: "Unicode" characters allow the characters of major world languages to
be placed in documents, such as documents in the HTML format. The
spam message shown above, which I received in 2004 March, illustrates
a conventional use of Unicode characters -- in this case to represent
letters of the Russian (Cyrillic) alphabet.
FIGURE: Spammers have found another use for Unicode characters: displaying
characters that look like English letters, but in fact are letters and
symbols from other world languages. Thus, English readers, humans,
have no trouble reading the text, but automated content scanners
will fail to detect the presence of "spam-indicating" words.
Of course one solution is to build up a table of how Unicode
characters visually relate to English letter and number characters.
But, given the large number of Unicode characters visually
"compatible" with various English letter and number characters,
this effort is likely to be intractable. Combine this with
strategic misspellings and random interjection of punctuation,
and content filters are doomed to fail.
I suppose an isolationist American could block all e-mail containing
Unicode, but even plain English characters can be used in creative
ways that humans have no trouble reading but present an intractable
problem for content scanners. A filter that rejected ungrammatical
content, or content with many misspellings, would likely claim a
large fraction of "legitimate" domestic e-mail!
8. Off-topic text examples from 2004
The following images were taken from spam messages I received during the year
2004, illustrating use of off-topic text.
|
FIGURE: This spam message includes a paragraph from a formal text (in this
case a Travel Warning issued from the United States Department of State
on 2004 March 23rd: http://travel.state.gov/israel_warning.html; I
discovered this by a Google search for "curfew should remain indoors").
The meaning of the text isn't as important as the fact that it is
grammatical, potentially "interesting", and large enough to greatly
"outweigh" any spam indicators that might be detected elsewhere in the
message.
9. Base-64 encoding of HTML examples from 2004
The following images were taken from spam messages I received during the year
2004, illustrating use of base-64 encoding of HTML.
"Base-64 encoding" is a method of representing sequences of Byte values by a
sequence of ASCII characters within the following set of 64 characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
Thus, the character 'A' corresponds to an integer value "0" (zero), and the
character '/' corresponds to an integer value "63" ("111111" in binary).
Groups of three Bytes from an input sequence are regarded as a sequence of
24 bits. Four 6-bit values are extracted and converted to the corresponding
characters in the set above. If the input sequence has a total number of
Bytes that is not a multiple of three, the input sequence is padded with
zero-value Bytes, and the output sequence is padded with '=' characters.
Base-64 encoding is typically used to allow binary data, such as e-mail
file attachments (with extensions ZIP, JPG, MP3, DOC, EXE, etc), to be
contained in the plain-text body of an ordinary e-mail. Thus, text-based
operations can be conducted on mail archives without worrying about
encountering non-ASCII characters, or problematic "control characters"
(such as "Null"/NUL/0x00/^@, "End of Transmission"/EOT/0x04/^D, etc).
However, spammers have used base-64 encoding as a simple means to obfuscate
their HTML content. Thus, very simple text filters, or human readers, cannot
easily examine the content of such spam messages. Sure, it is easy to add
a base-64 decoding stage to a content-based spam filter, but this is yet
another example of the increasing complexity of spam detection.
|
FIGURE: This is the plain-text appearance of a spam message containing HTML
encoded as base-64.
The following C code compiles to a very crude base64-to-text conversion
utility. One must manually place a base-64 block of text, by itself, in a
text file, and then use this utility to generate text output -- which, by the
command line, can be directed to an output file. I only wrote this code as a
learning experience, and to convert samples of base-64 from e-mail messages.
I offer it for illustration, and perhaps inspire you to seek a "real" base-64
decoding utility elsewhere.
FIGURE: This is the plain-text decoded form of the base-64 content in the
spam e-mail above.
Apart from revealing the web site address promoted by the spam
message, the decoding also reveals "random" words designed to
"dazzle" Bayesian spam filters (such as the Mozilla browser
Bayesian spam filter). It is important to note that this
spam-filter countermeasure was placed within a base-64 encoded
block -- which means the spammer assumes that base-64 blocks
will be decoded and subjected to content analysis.
One more VERY IMPORTANT thing to note about this example:
The spam message is totally contained in the image specified by
the HTML image tag, stored on a remote server. Thus, unless a
content-based filter notes the "tabs", "biz", and "pills" parts
of the web site address and file paths, this message is totally
benign. (A better spammer would not put "tabs" and "pills" in
any part of a URL. An explicit IP could eliminate the explicit
"biz" domain part -- although a reverse DNS lookup could find the
"biz" part.) The point is that it is trivial to eliminate any
evidence that the message is spam. Sure, one could add a new
spam detection rule that flags e-mail messages that only contain
HTML image tags, etc, but the risk of flagging legitimate e-mails
in the process is high.
10. Example of tracing spam from 2004
This section describes a modest effort to learn more about a spammer.
|
FIGURE: Despite the "explanation" of why this business is able to offer such
low prices on software, this spam message was clearly sent from a
software pirate. If indeed one can actually pay money and initiate
a file download from this spammer's server, there is no assurance
that the file will be the actual application, as opposed to a virus,
or a virus-infested version of the desired application.
In any case, for the purpose of illustrating the search for information
about sources of spam, I chose to lookup the owner of the domain
(associated with a URL found in the HTML code of this spam message):
"buye-soft.biz".
The search begins by going to the "whois" search page of the InterNIC web site:
FIGURE: I assume ALL of the information returned by this "whois" search is
fake -- except for the trivial details: Domain Name, Domain ID,
Domain Status, Registrant ID, Name Server, Created By Registrar,
Last Updated by Registrar, and the dates.
The only really interesting part is the "Domain Registration Date",
which in this case indicates that the domain was created less than
one week prior to my receipt of a spam message containing links to
a web site in this domain.
I have seen this many times -- domains registered within days of
corresponding spam messages (containing links to hosts within those
domains). As mentioned earlier, the $20 one-year domain registration
fee, and a $20 web hosting fee, makes for a $40 total overhead to
starting a spam campaign -- and a single customer, or a single scam
victim, can easily surpass the break-even point of the business model!
DEFINITION OF "SPAM"
Spam: An unsolicited message sent to a large number of recipients.
|
Unfortunately, that simple description has subjective terms.
The subjective terms are considered in this section.
A. "Unsolicited"
Whether or not an e-mail message was "solicited" (i.e., the recipient
explicitly or implicitly requested the message in advance) is one of
the most-debated issues related to spam.
"Registration" for Online Services
Many online services require "registration" -- a process whereby an
entity (e.g., corporation or organization) gathers personal
information from a person who desires to use their service. The
registration process may demand an e-mail address. The registration
process may enforce this demand by sending an e-mail to the
prospective user of the service with a piece of information essential
to the completion of the registration process (such as a password,
or a temporary URL to visit) thus validating the (potentially
temporary) association between the e-mail address and the user.
If the service provider is an ethical corporation (i.e., one whose
controlling power is not beholden to the interests of profit
alone, but instead has ethical conduct as part of its mission),
then very early on in the registration process there will be a full
description of how all pieces of information will be used (and,
perhaps more importantly, not used).
Unfortunately, e-mail "notifications" and offers from "partners and
affiliates" is often part of online service contracts.
The Abuse of the "Opt-in" Concept
Although the practice is fading as spammers simply give up on maintaining
the illusion of compliance with laws, for a time (around 2000-2002) it
was common to see spam messages ending with a disclaimer that essentially
stated the following:
"This message is not spam. You are receiving this message
because you requested this message from this service, or
opted-in to mailings from one of our affiliates."
One can either ponder in amazement that Citibank, Bank of America, or
Microsoft would be the "affiliates" of pornographers, illiterate Viagra
purveyors, wireless spy cameras, and software pirates -- or conclude
that the disclaimer is a total falsehood.
It is not difficult to imagine that a giant corporation might have a
slightly less monitored affiliate with slightly lower standards of
business ethics. And it is easy to imagine that through the corporate
equivalent of "Six Degrees of Separation" that eventually personal data
submitted to, say, any giant corporation might actually, through a chain
of "affiliation", make it down to any given business, or outright
criminals, on our planet.
|
ASIDE: Spam Analogy in Other Forms of Indiscriminant Communication
It is interesting to note that billboards on the sides of freeways, busses,
and taxi cabs, might qualify as a kind of government-approved visual spam.
I bet if a proposition to eliminate all billboards was placed on a state
ballot, the overwhelming majority would vote in favor of the proposition.
The fact that billboards clutter our visual space testifies to the gap
between local government and its constituents.
I think it is interesting to consider spam in relation to billboards that
essentially radiate data by photons in all directions without regard for
recipients, or loudspeakers that radiate data by sound waves radiating in
many directions without regard for recipients, or mass-distribution of a
single message via postal mail, facsimile machines, telephone, or fliers.
Some of these "spam" variations rely on proximity, so the "technology" to
avoid such spam is simply to move away from the emitter. But other
variations essentially bring the message very close to targets, such as
postal mail or a telemarketing phone call. Here, the "technology" to avoid
being distracted with the task of differentiating between spam and desired
messages is limited to "refusing bulk postal mail by request" and becoming
part of the "national 'do-not-call' list" (which relies on vigilant
consumers, and laws to act as a deterrent for would-be violators).
|
|
B. "Message"
Obviously the term "message" must be used in the broad sense, referring
to the "intention" of the data received by each recipient. Otherwise,
permuting the sentences of one message to construct a second message,
for example, would be considered distinct "messages".
Unique Text per Recipient
Almost all spam messages today have contained procedurally-
generated text that is unique per recipient.
Some of this procedural text is totally-random words (to defeat
word-frequency filtering, or word-pair Bayesian Filtering), or
random-but-grammatical (to defeat more advanced grammar-based
filtering), or random text from online news and information sources
(which is likely to be overwhelmingly approved by any filtering).
Also, the real message text, the intended communication, can be
procedurally modified to be unique per intended recipient.
This can occur at the character level, word level, sentence level,
and paragraph level. Misspellings can be introduced, especially
using look-alike characters ('0' vs. 'O'; or Unicode characters)
or transposing adjacent letters within a word ("Viagra" vs.
"Vaigra"). Sentences can be permuted.
So, defining "message" as a literal sequence of characters (or
bytes) will fail to identify "spam".
|
|
METHODS WHICH FAIL TO SIGNIFICANTLY REDUCE IMPACT OF SPAM
1. LAWS
The assumptions leading to the creation of laws to indirectly reduce the impact
of spam include:
(a) The existence of a tough law against spamming will serve as a sufficient
deterrent for potential spammers;
(b) The person ultimately responsible for the spam messages can be
identified;
|
Reasons why laws cannot stop spam include:
(1) Messages can originate in countries that either do not have spam laws or
do not have sufficient resources to enforce spam laws;
(2) Spam may originate from any of the billions of people on our planet with
Internet access, and although laws may lead to a sufficient psychological
deterrent for the vast majority, it only requires a few people to
generate billions of e-mail messages per day.
(3) The connection between businesses and spamming campaigns will be
increasingly difficult to make, especially if there are a few cases in
which competitors or hackers seek to implicate a company as a spammer
by secretly initiating a spamming campaign on that company's "behalf".
Such a scenario, among others, would introduce a level of doubt about
the simplistic argument that the party that benefits from spam must
be responsible for the spam. There is, rightly, a large amount of
plausible deniability in this context.
One particularly difficult variety of spam campaign to prevent with
legal deterrents is one launched spontaneously by a person on his or her
own initiative on behalf of a political cause or some other large-scale
phenomenon. A person with a modest amount of programming ability can
compromise any mail system and use it to spread a message that was not
endorsed by the organization that might ultimately benefit. The
individual spammer promotes a cause, which, if advanced, will somehow
end up benefiting the spammer. It's a kind of "advocacy laundering",
relying on the large number of advocates and potential beneficiaries.
Examples include: promoting specific stocks; promoting political
candidates or legislation; promoting hate; etc.
(4) Although this is more of an observation than an explanation, it is
interesting to consider that many spam messages are far more illegal
than simply wasting people's time with unwanted advertisements. Spam
messages can be used to deploy viruses (destructive, spying, use of
computing resources) or play a role in the process of "identity theft"
(such as directing a person to a fake clone of a banking or service web
site and requesting confidential information, often, ironically, with
the claim of "increasing security" through "verification"). The people
behind such spam efforts are way beyond considering the consequences of
spam-specific laws.
|
2. IP/E-MAIL ADDRESS BLACKLISTS
The following describes the concept of an IP/e-mail address blacklist:
Each time a complaint about spam originating from a specific IP address (or an
entire subnet or domain) or from a specific e-mail sender address is reported,
the source is added to a list (blacklist) that is used as a basis for deciding
whether or not to allow future e-mail messages to pass.
|
Reasons why a blacklist method cannot stop spam and would lead to many problems
and ethical issues include:
(1) Spammers can easily change the origin of spam. Domain registration is
as low as $20, and renting the use of a web server in a data center can
be very inexpensive and without significant commitment. Do a "whois"
lookup on the domain name associated with a spam message (if any such
links exist in the body), and you may discover that the domain was
established just a few days prior to your receipt of the spam message
-- a delay just long enough for the domain registration to propagate
to DNS servers world-wide. Even if the blacklisting occurs on the same
day as the spam campaign, the spam has already reached all intended
recipients.
(2) Blacklisting introduces the risk of a result that is incomprehensibly
bad: The diminishing of free communication.
China blocks many web sites (including, apparently, my own site).
Some corporate web sites refuse traffic referred by other sites. Search
engines, which many people rely on as an "unbiased representation of
web content", impose their own content filtering and prioritization
(sometimes to avoid lawsuits, or to serve the paid interests of business
partners or investors).
Blacklisting is opposed to democracy, where people depend on being able
to gather or receive information without any bias in the mechanism of
gathering or receiving. When major "news" sources fail to deliver
information without bias, the Internet may be our only recourse in the
quest for the facts. Any mechanism that interferes with such a quest
is opposed to democracy. Even "gate-keepers" with good intentions can
lead to a condition in which freedom of expression and reception of
ideas is greatly diminished without a corresponding degree of benefit
from the act of filtering information.
The best type of decision-making involves having as much reliable
information as possible, and many important decisions depend on learning
about conditions that are frequently changing. Blacklisting
interferes with such decision-making.
(3) Malicious people can provoke the blacklisting mechanism in to blocking
domains, subnets, or e-mail addresses. By studying the blacklisting
mechanism, one can devise a method to use it to one's advantage.
Hackers can cause the blacklisting of anything capable of being
blacklisted, perhaps simply by generating "spam" on "behalf" of each
intended blacklisting target.
(4) Automated blacklisting cannot differentiate between non-critical links
and critical links in spam HTML code. Thus, a spammer can populate
e-mail messages with links to reputable web sites not actually
affiliated with the spam campaign (such as putting real links to
PayPal, eBay, Earthlink, Citibank, etc, to make a convincing clone
of an actual corporate announcement). Automated blacklisting may
blacklist legitimate web sites or domains, and miss the highly
obfuscated links to the site associated with the spam effort.
(5) Blacklisting will fail to achieve any beneficial result and will cause
many random people communication problems in the case of spam
generated by distributed mail servers (as with Trojan viruses that
install e-mail servers on all infected systems). Not only can the
e-mail come from any of thousands of infected machines, but can even
offer working links (URLs) to tiny web servers in any of the thousands
of infected systems. The hacker is free to link in to the
peer-to-peer virus system and gradually and passively accumulate,
for example, "identity theft" information. Blacklisting would likely
block random individuals, and possibly many random domains.
The virus need not be a stand-alone, clandestine application, but can
in fact be a hacked form of a standard peer-to-peer application.
Given the capabilities and lack of attention to high-security for
popular peer-to-peer applications (let alone the applications people
share and execute), conveying a patch or script to transform a
file-sharing application in to a platform for spam delivery and for
web page serving (i.e., totally decentralized "web site" with
thousands of radically different URLs with the same content) is
trivial.
|
3. GENERIC HUMAN-SKILL CHALLENGES
The following describes the concept of a generic human-skill challenge:
If a resource, such as a communication channel, is only intended for use by
humans, and not automated systems, then a challenge can be designed to
determine if a request for access to the resource comes from a human.
A generic human-skill challenge is a task that the overwhelming majority of
adult humans can easily accomplish without any prior knowledge about the
challenge, while, at the same time, any practical automated system would
consume resources (time, space, cash, etc) far beyond the value of the
resource protected by the challenge.
|
The free "Yahoo! Mail" service uses a generic human-skill challenge during
the registration process to prevent automated systems (created by spammers,
for example) from using registering for thousands of e-mail accounts.
Aspects of "Yahoo! Mail", including the generic human-skill challenge, is
shown in the following figure:
One VERY IMPORTANT thing to consider about "Yahoo! Mail", as described by
the quotes shown above regarding service features, is that e-mail messages
AND attachments are AUTOMATICALLY ANALYZED. While this analysis may have
been devised by well-meaning individuals trying to improve the experience
of users of the "Yahoo! Mail" service, it is critical to understand that
this is detailed analysis of one's communication. Although I believe the
privacy of e-mail itself is promised up to a specific point (like court
subpoenas), I do not see any such promise of privacy concerning the use of
the results of the analyses (which may actually contain rough semantics
(meanings) of content). Is a transliteration of an e-mail, using different
figures of speech and phrases, and synonyms where possible, assured the
same level of privacy? In what sense is even an "anonymous, automated"
analysis of e-mail messages and attachments "private"?
Generic human skill challenges can be compromised by automating a system of
getting actual humans to unwittingly solve challenges on behalf of the
automation. Yet another compromise is to devise algorithms that can actually
perform the task that supposedly requires "exclusively-human" skills.
|
4. COMMUNICATION-RATE REGULATION BY REQUIRING RESULTS OF DIFFICULT COMPUTATIONAL TASKS
The following describes the concept of communication rate regulation by
requiring results of difficult computational tasks:
Communication rates can theoretically be regulated by requiring a message
sender to solve a random mathematical problem, of known difficulty, posed
by the receiving party. If the solution offered by the sender is not
correct, according to the receiver, then the message is rejected.
The receiver can multiply two probable-prime numbers together, having a
product with a specific number of digits, and request that a message
sender do the work of factoring the product back in to the prime factors.
The receiver can easily check whether or not the factors submitted by
the message sender yield the correct number when multiplied together.
Although a factoring task will take a different amount of time for
different processor types and processor speeds, the order of magnitude
of the time required might be the same for all contemporary mail servers.
If the message receiver thinks the problem was solved too quickly, it
can pose several more problems.
The computational work keeps the sender processor very busy, which, in
principle, prevents the sender processor from the task of sending lots
of messages to the same or multiple communication relays.
|
Reasons why a communication-rate regulation by requiring results of difficult
computational tasks can fail to control spam include:
(1) Much like distributed computing projects like "SETI@Home" or
"Folding@Home", spammers can use trojan viruses or hacked versions of
popular Peer-to-Peer (P2P) applications (BitTorrent, Morpheus, KaZaA,
DirectConnect, Gnutella, Napster, eDonkey) to "farm out" ("out-source")
the computational tasks requested by the pesky communication relay.
In fact, the P2P applications can do the work of mail transmission on
the spammer's behalf, spreading the computational work.
(2) Unless the communication-rate regulating tasks are required between
mail relay servers, in addition to being required between sender
client and a mail server, then all it takes is for some mail relay
server to be compromised, either by hackers or just by unscrupulous
server managers, for spam to enter the system with full force.
(3) One wonders if "affiliates" of giant corporations will face the same
computational tasks when hoping to fill your mailbox with spam!
After all, one might be forced to allow all "partners and affiliates"
of a mail service provider to notify one of special offers and
products. And there is no end to the possible "affiliate" chain...
|
5. CONTENT ANALYSIS
The following describes the concept of content analysis:
Content analysis involves automated scanning of e-mail messages to
differentiate between messages serving the agenda of spammers and messages
sent from: family, friends, colleagues, coworkers, subscription mailing lists,
and strangers who initiate contact. This differentiation must be based on
the content of the message itself. Considering other factors, such as sender
address, or blacklisted IPs embedded within the message, are not properly
part of plain "content analysis".
|
Reasons why content analysis is undesirable:
It is difficult to foresee all of the consequences of loss of privacy in
one's life. Loss of privacy regarding medical conditions (AIDS, STDs,
cancer, drug addiction, alcoholism, etc) can result in discrimination
of various sorts. Loss of privacy regarding ideas and beliefs can
result in embarrassment or dismissal from jobs. Loss of privacy regarding
inventions, creative ideas, business plans, agendas, competitive
strategies, etc, obviously can lead to profound financial and intellectual
losses.
It is interesting to imagine a world totally without privacy. I have
considered how I would adapt. Many institutions and social conventions
that relied on privacy would be gone. I'm inclined to think that this
new world would be better, as hypocrisy and politics would be shattered,
and people would operate from truth rather than image. No illusion
could be sustained. No crime would go unsolved. No hypocritical
positions on drugs, pornography, religion, etc, would go unnoticed.
So, I'm a little torn on this issue. If society could flip a switch
and enter "total information mode", then maybe I'd feel more comfortable.
But when there is an imbalance, with large corporations or government
agencies enjoying the benefits of "total information mode", while
consumers and citizens do not have such a benefit, then I am concerned.
Google's new "G-Mail" service (run by "G-Men"?!) is a paranoiac's
nightmare come to life: an e-mail system that analyzes each piece of
mail not simply for spam detection and virus detection, but for
targeted advertising! Want to embarrass a friend using G-Mail?
Send an e-mail mentioning only "farm sex" or "kiddie porn", and
who knows what "targeted advertising" he or she will get. Perhaps
there will be a knock on the door from the FBI.
Despite promises that content analysis will be isolated from any
record-keeping, and totally performed anonymously by computer
algorithms, the fact is that one's mail is being analyzed, and
even requests for banner ads represents an information leak --
connecting your interest with your IP (since your browser fetches
the banner ad from whatever source). With a modest amount of
extra data mining, an "anonymously targeted" banner ad betrays
your identity.
|
Reasons why content analysis can fail to control spam include:
(1) Ultimately, only a message recipient can decide, based on content alone,
whether or not a message is desired. Clearly a "one-filter-fits-all"
idea won't work, so the alternative is "training" a filter, or directly
specifying message content that is not desired. Even then, rules and
exceptions will not handle the next wave of spammer messages.
Every content-based filtering method thus far has been thwarted. All a
spammer has to do is study a spam filter to devise messages that will
pass through. Perhaps this will eventually require the spammer to use
human-like AI to generate human-like mail that essentially conveys
the spammer's "message" but without triggering any content-based filters.
Randomly injected HTML tags thwarted simple scanning and dictionary
lookup. Salting the message with random words thwarted Bayesian
filtering. Including boring texts from random, but serious, sources,
made messages overwhelmingly grammatical and statistically close to
potentially "interesting" message content, thwarting advanced
content-based filtering. Base-64 encoding added another layer to the
content scanning work. Unicode, visual character look-alikes, randomly
injected (but non-obtrusive) punctuation within words, strategic
misspellings and letter transpositions that are transparent to human
readers, etc, all result in the actual MEANING of the content being
totally UNDETECTABLE! Finally, the use of images served from a web
site, in conjunction with invisible, but totally benign, content,
is totally immune to the desired kind of content analysis.
(2) As mentioned with other spam filtering methods in previous sections,
the problem of mail services not performing filtering on the promotions
and offers of their "partners and affiliates" may just change the
nature of the spam from world-wide to high-paying corporate sponsors.
This puts the burden of content-based filtering on the client.
|
OTHER UNDESIRABLE SPAM CONTROL METHODS
(1) Internet-wide packet tracking (does not defeat Trojan P2P type outgoing
mail servers; invades privacy)
(2) "Postage" per e-mail (does not defeat Outlook Trojans and other P2P
mail servers)
(3) Relying on corporations and building up "user reputation" (like eBay
system, or Slashdot karma); invades privacy, corporate interests may
create artificial barriers to moving data (e.g., BREW), and there is
no guarantee that corporations won't eventually "sell-out" the promise
of no spam -- or it will be argued that partner "notifications" won't
be spam (according to the end-user terms).
SPAM CONTROL PROPOSAL
This section contains a proposal for SOFTWARE and SOCIAL PRACTICES that have
the potential of greatly reducing the nuisance of spam from a person's life.
GENERAL INFORMATION
Things required by this proposal:
(1) A person who wishes to greatly reduce spam must install software on each
computer with an e-mail client application (such as Microsoft Outlook).
(2) A person who wishes to greatly reduce spam, when sharing his or her
e-mail address, must also go through the trouble of sharing a code
number.
(3) Mailing list services must make a slight modification to their databases
and mailing scripts to store and use codes in addition to e-mail addresses.
|
Things that are NOT required by this proposal:
(1) Changes to e-mail servers, e-mail protocols, e-mail content
standards, or Internet infrastructure, are not required.
(2) Existing spam countermeasures (content-filtering, IP blacklisting,
anti-spam laws, etc) will not be necessary. (Such countermeasures
are futile and dangerous anyhow.)
(3) It is possible that changes to existing e-mail clients will not be
required.
|
Things that will NOT be directly helped by this proposal:
(1) Internet bandwidth consumed by the futile efforts of spammers trying
to make it through to people. (Once the futility becomes apparent
worldwide, the spamming model may naturally be a very unattractive
waste of time.)
(2) E-mail "inbox" clogging while the spammer profession lingers on,
before the futility of spamming has a chance to sink in worldwide.
(3) People with e-mail clients and services provided by giant corporations
may not experience the diminished spam until the giant corporations
have a chance to update software.
|
Other qualities of this proposal:
(1) Totally open technology; not "security through obscurity".
(2) Non-commercial, public-domain method, can be implemented by anyone
without consideration.
(3) Totally smooth transition from current e-mail clients, servers,
mailing list services, etc.
(4) Privacy preserved (no content analysis), and possibly even improved
(as proposed software becomes more widespread).
|
CORE CONCEPT
The following paragraphs describe the core concept of the method. Certain
details will be discussed in the "Use Cases" section:
Messages received by an e-mail client will be sorted by codes contained in
the message subject fields or within the message bodies. Spam messages
are extremely unlikely to contain the proper codes, and are thus diverted
to an anonymous-sender category.
Unlike an e-mail address alone, which is a single, unmoving target for
spammers, the additional codes are generated by formulae, and are tiny,
constantly-moving targets in a huge expanse of possible target locations.
Furthermore, any breach of trust can instantly be traced to specific
unscrupulous people, and immediately and conveniently patched. The
concept can be likened to "spread-spectrum" communication, or, much more
loosely, "port knocking".
|
CORE IMPLEMENTATION
The following paragraphs describe the core implementation of the method.
Three encrypted files are stored on an e-mail client machine:
(1) PRIMARY FORMULA TABLE:
Encrypted table with entries in the form:
( SHA hash of recipient e-mail address, primary formula )
(2) SECONDARY FORMULA TABLE:
Encrypted table with entries in the form:
( SHA hash of recipient e-mail address, secondary formula )
(3) CURRENT RECEPTOR TABLE:
Encrypted table with entries in the form:
( SHA hash of acceptable formula value, category, expiration policy )
|
Basic operation:
Some time in advance of communication between a sender and receiver,
the sender acquires a formula from the receiver. The formula is
generated by the receiver and given to the sender by some "secure"
mechanism (which can be as casual as a face-to-face conversation,
phone call, postal mail, facsimile, or even conventional e-mail or
web page). The only point of keeping the formula secure is to
maintain the exclusivity of the specific communication channel
between sender and receiver. It is not that others can use the
formula to intercept messages, but others could use the formula to
get their messages to the receiver through the same moving portal.
The sender uses the formula to compute a code, and this code is
placed either in the subject line or in the body of the message,
and the message is sent by conventional e-mail to the receiver's
well-known e-mail address.
The receiver launches software to sort incoming mail. When the
software is launched, it requests a password. This password
decrypts the current receptor table. Incoming messages are
scanned for the special codes (which have easily-recognized
characteristics). If a message contains a code, the code is
hashed using the Secure Hash Algorithm (SHA). The hashed code
is compared with all acceptable hash codes in the decrypted
copy of the current receptor table residing only in memory.
If there is a match, the e-mail is accepted and placed in to
the proper mail folder (Family, Friends, Business, Interest
Groups, etc).
If any entries in the current receptor table are set to expire, then
the primary formula table is decrypted in memory and all formulae
are evaluated, replacing all entries in the current receptor table.
This implements the "moving" aspect of the "moving target" nature
of the method. (Senders evaluate the appropriate formula upon
each transmission.)
|
USE CASES
This section describes examples of using the proposed spam control
method. Important qualities include: simplicity, convenience, reliability,
and minimal transition effort.
1. Personal, Regular Contact (Family, Friend, Colleague)
Sub-cases:
(1) Potential sender cannot or will not use the special software
used to implement this method:
The receiver (with the special software) can generate and
give out a fixed code (like a telephone number) for use by
the sender alone.
The sender simply puts this code somewhere in each e-mail.
If the code should leak beyond the friend or family, the
code can be manually "retired" by the receiver, and a new
code can be given to the sender.
(2) Potential sender is also using the special software:
The receiver can give the potential sender a formula (which
is essentially a code used to generate codes), and the
potential sender enters the receiver e-mail address and the
formula in to the primary formula table, and, possibly,
a second formula (and receiver e-mail address) in the
secondary formula table.
When the sender initiates the sending of an e-mail message
to the receiver, the special software evaluates the primary
formula specific to the receiver and places the appropriate
code in to the outgoing message.
|
2. Mailing List Service
Sub-cases:
(1) Mailing list server cannot or will not use the special software
used to implement this method:
There is nothing to be done in this case. The mail arriving
from the mailing list server can only be classified according
to its claimed sender, or subject, or content.
(2) Mailing list server is also using the special software:
When subscribing to a mailing list service, the receiver
can give the mailing list service a formula (which
is essentially a code used to generate codes), along with
the receiver's well-known mailing address.
When the mailing list server initiates the sending of an
e-mail message to the receiver, the special software evaluates
the primary formula specific to the receiver and places the
appropriate code in to the outgoing message.
*** Ideally, the mailing list server would expect codes
from subscribers, too, thus preventing non-subscribers
from spamming the list. ***
|
3. Public Displays (Web Pages, Printed Matter, etc)
This is really not much different than case "1. Personal, Regular Contact
(Family, Friend, Colleague)".
For web pages, a server-side script (PHP, ASP, etc) can evaluate a
formula to display codes on the web content -- preferably as images.
Anonymous human visitors can read the codes and use them with any
e-mail client to contact the receiver associated with the web site.
The formula can be associated with a single page of the site, or
with the entire site. The formula can be re-evaluated on any
desired time-scale, thus frustrating anyone who takes the time to
manually enter the current code in to a spammer database.
For printed content, a code can be printed along with the well-known
e-mail address. This code is specific to the printed campaign, much
like a URL for a movie promotion web site is only ever going to be
used for that specific movie promotion campaign. The code can be
retired after the promotion is over.
|
BREACHES OF SECURITY
This section describes various scenarios involving breaches of security.
(1) Trial-and-Error attack on specific user's e-mail address:
Apart from clogging the user's e-mail "inbox", the only possible
reward for the unlikely event of making it through the method's
barrier is that a single e-mail gets through. The attacker does
not learn anything after the attack, and only a similar effort
would have a chance of getting another e-mail through. The
probability of success is inversely proportional to the length
of the codes generated by the formulae -- which, for seven-letter
codes, would be 1/(8031810176/2), about 1 in 4 billion chance,
roughly multiplied by number of active formulae, per attempt.
Most e-mail "inboxes" would be clogged by 4 billion e-mail
messages, and such an attack is clearly an entirely different
problem from the spam problem.
(2) Finding copy of a plain-text e-mail at any location accessible
through the Internet (even unauthorized or unanticipated access
via unprotected directories or accidental wireless exposure to
private drives):
This compromises a code. If the code is associated with a
formula-based channel (all parties have the special software),
nothing needs to be done; the code will expire, and it may
in fact be a one-time use code.
Worst-case outcome: A stranger sends e-mail that is not filtered
out by the receiver.
(3) Compromise of a system with a "mailing list" or "address book"
(in both cases this simply means the three files listed in the
CORE IMPLEMENTATION section):
Sub-cases:
(A) The files are corrupted or deleted:
The user of the software will know this has occurred.
One option is to restore the files from a recent
backup.
(B) A hacker does a key-logging on the machine and acquires
the key for decrypting the primary formula table and
current receptor table -- and then copies those files
from the compromised machine:
This is a serious compromise. If this has been
suspected or detected, then a user can rid the
system of intrusion and then use a different key
to decrypt the secondary formula table and
command each recipient to replace entries for
the compromised sender and switch to the
secondary entries. New formulae can be exchanged
by methods that depend on how secure the parties
feel circumstances are at that point.
(4) Compromise of system relaying e-mail at any point in the Internet:
Sub-cases:
(A) Message content is encrypted: Compromise has no impact.
(B) Message content is not encrypted; codes can be extracted:
In this case a code has been compromised, in addition
to message content. If the sender has the special
software, the code may only work once, and the
snooping party will not be able to get through the
receiver's shield when the code expires (which can
be upon receipt of a message).
(5) Compromise of any archive or cache of e-mail messages
Same as case (4)
(6) Compromise of a desktop computer via executable (altered pirated software,
MS-Outlook script, prior e-mail attachment virus, IE exploits/Active X,
Windows/Linux exploits, etc):
This is a serious compromise. The hacker has free reign of one's
computer, including the ability to corrupt or delete the encrypted
files associated with the special software. The hacker can also
modify or replace the special software.
The hacker will only gain access to the decrypted content of the
primary and secondary formula files if the regular user of the
compromised system opens those files following the compromise
event.
Even if the hacker has access to the decrypted contents of those
files, the hacker will have to hash all e-mail contacts to determine
the correlation between contacts and formulae.
Finally, if the hacker really accomplishes all of this, the only
reward is the ability to avoid the e-mail filtering of one's
contacts.
The kind of compromise described here is far worse than the
problem of receiving unwanted spam!
(7) Unscrupulous business shares voluntarily offered formula to
"partners and affiliates":
The formula can be instantly removed from the receiver's table
at the first sign of abuse, and one can indicate the violation
of trust to the business.
|
LONG-TERM IDEAS
Automatic encryption and decryption of message content at sender and
receiver prevents all intermediate routers, Mail Transfer Agents (MTA)s,
subnets, etc, from doing content analysis or complete spying on messages.
A plug-in implementing RSA Public-Key cryptography for various mail clients
can be found at (http://www.rsa.org). Ideally, the special software
proposed in this article would have this capability.
If all e-mail contains the codes for the method described in the proposal,
and all e-mail is encrypted, then codes will enjoy longer usefulness in
keeping a channel reserved for certain parties to communicate, and content
will remain private.
CONCLUSION REGARDING PROPOSED METHOD
I did not describe the details of how the proposed system would work, but I
hope the proposal aspect of this article leads to more thinking about
solutions to spam -- especially about solutions that avoid invasion of
privacy by any form of content analysis or packet tracking, or cooperation
with specific corporations, or censorship.
CONCLUSION
Spam messages may be annoying, and may consume resources, but I strongly
disagree with laws punishing spam. I also strongly disagree with any
efforts to filter messages flowing through mail servers, or the practice
of blacklisting hosts or domains. None of these approaches will be
effective in the long term. Meanwhile, my proposed solution makes e-mail
as private as an individual desires, and does so without losing a single
message to an over-zealous filter or to corporate or government censors.
POSTSCRIPT -- RESPONSE TO SLASHDOT COMMENTS
I wrote this article, and submitted it to Slashdot, in an effort to increase
awareness of the technology of spam and the perils of various previously-
proposed solutions.
I acknowledge shortcomings with my own proposal, but I hope my article has
contributed something to the ongoing conversations about controlling spam.
After reading most of the Slashdot comments regarding my article, and
talking with a smart friend, I have been led to the following conclusions:
(1) The proposal in my article does not SOLVE a critical problem:
"How does an automated system differentiate between acceptable
attempts of strangers to make contact versus unacceptable
attempts of strangers to make contact?"
I did not have the clarity to think of the spam problem in terms
of this simple security puzzle when I wrote this article -- but
it's exciting now to think about this specific issue.
(2) The proposal in my article does not SOLVE another critical
problem: Consumption of resources. Specifically:
(a) Consumption of bandwidth all along the route from
spammer to destination mailbox;
(b) Consumption of disk space on intermediate and
final destination mail servers;
(c) Consumption of precious bandwidth between ISP and
person downloading e-mail (especially when using
dial-up connnections);
|
I am considering these issues, along with the comments made on Slashdot.
The following Slashdot comments have supported or refuted various ideas
put forward in the article, and I am meditating on them:
Your post advocates a
( ) technical (*) legislative ( ) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work.
(One or more of the following may apply to your particular idea, and it may
have other flaws which used to vary from state to state before a bad federal
law was passed.)
( ) Spammers can easily use it to harvest email addresses
( ) Mailing lists and other legitimate email uses would be affected
( ) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
( ) It will stop spam for two weeks and then we'll be stuck with it
(*) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
( ) Requires too much cooperation from spammers
( ) Requires immediate total cooperation from everybody at once
( ) Many email users cannot afford to lose business or alienate potential
employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for
( ) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
( ) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
( ) Asshats
(*) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
( ) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
( ) Armies of worm riddled broadband-connected Windows boxes
( ) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
(*) Joe jobs and/or identity theft
( ) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook
and the following philosophical objections may also apply:
( ) Ideas similar to yours are easy to come up with, yet none
have ever been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
( ) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
( ) Sending email should be free
( ) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
( ) I don't want the government reading my email
(*) Killing them that way is not slow and painful enough
Furthermore, this is what I think about you:
(*) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.
( ) Nice try, assh0le! I'm going to find out where you live and burn
your house down!
|
I don't know what spam data you used, bit i've noticed quite a few spams
getting through my bayesian filter lately... they all have more random
words in sentances at the bottom than the real message at top. They do
it like 'hank urged me and I to send you this flower and important notice'
Bad grammer but i'm sure it's ment to look like a 'real' sentance since
the computer can't 'read' like a person. It's kinda like an adlib game...
they make a list of several hundred sentances with verbs and/or nouns
missing then use word lists to fill them in.
|
we're using GFI MailEssentials at work, which uses bayesian filtering.
all the emails that have a random string of words arent getting through
the filter. the ones that are usually contain, e.g. 1 normal sentence,
and one link. the problem is, we arent getting the same one over and
over, it only comes in once, so it's hard to train the filter to block it.
but hey, what do i know?
|
"I use bogofilter and have a corpus of 20k spam messages, I always rescore
misfiltered spam, and I still get messages that slip through the filter.
Almost all are messages with a ton of random garbage appended to the
message, and one spammer was actually putting whole passages from some
book about Abe Lincoln in the messages.
Jamming the message with non-spam words works too well around here."
|
It's dangerously bad. If email messages accurately identified where they
came from, and if spammers didn't maliciously forge addresses of people
they want to harass, and if spammers didn't usually abuse free email
systems and free web pages or forge purely bogus sender addresses
(usually also at free email systems), then that would be a fine idea.
Many spammers also frequently put other people's valid URLs in their mail
to fake legitimacy, e.g. URLs from E-Bay's news site or the Better
Business Bureau or various anti-virus companies, in addition to having
their own URL for the suckers to click.
|
Strange, when I create new e-mail accounts, like my work account, I
generally have a couple messages of spam waiting for me already.
|
That worked for me until I emailed a customer feedback comment to a
somewhat large corporation which makes a product I really like. I also got
a satisfactory reply from their customer representative.
A few months later, that *expletive* customer representative forwards
one of those stupid urban myth chain-letters (about some missing kid/fake
amber alert), using that company's email address book, which included my
email address!
Then the spam deluge started. :(
|
I used to think that would work, then I went to college and had spam in my
inbox before I had even first checked it. I hadn't given out the address
to anyone, not even my close friends.
|
Is it just me, or has recent spam flavor included random sentences
(not just random word lists) that are meant to sound like a plausible
person is on the other end?
Then, embedding some link to spam inside, in an attempt to get the S/N
filters to let it pass?
|
This is what I got:
Jen, searching for a site to purchase medication?
Character is that which reveals moral purpose, exposing the class of
things a man chooses or avoids.
Those who aim at great deeds must also suffer greatly.
Let your imagination release your imprisoned possibilities.
We are able to ship worldwide
Be thrifty, but not covetous.
Go here and get it
You are totally anonymous!
I confess I enjoy democracy immensely. It is incomparably idiotic, and
hence incomparably amusing.
Epigrams succeed where epics fail.
The only line I deleted was the one with the url... now tell me what
this spam message was trying to sell!
|
I've been getting alot of spam recently that have images of the spam text
itself, and include a "post-modern story of the day" at the bottom in
plain text in order to trick/reduce effectiveness of Bayesian filters.
Or some will do the same and just have a collection of random words
(like "baseball sports espn issue meeting car subaru" etc).
|
It should be self-evident that this solution is not workable.
Anything that requires this massive type of retooling of the whole method
of using e-mail is doomed to failure.
Any proposed solution cannot cause this type of massive interruption of
normal e-mail usage.
|
"It should be self-evident that this solution is not workable.
Anything that requires this massive type of retooling of the whole method
of using e-mail is doomed to failure."
This attitude is what keeps real solutions from occuring. SMTP/POP3 is
antiquated, designed for a simpler time, and it needs replaced, period.
If there were anything in its standards that could truly prevent spam,
don't you think someone would have come up with it in the last 15 years?
And so what if we have "interruption of normal e-mail usage" for a while?
What do you think we have now? Millions of tiny "interruptions" bouncing
around 24 hours a day. Slowing things down, wasting resources, wasting
time, etc.
These band-aid fixes are just that. They are not a solution. So I don't
have to see the beastiality or xanax ads anymore, great. That doesn't
mean they aren't still consuming mass resources in their continuous
effort to reach me.
"retooling of the whole method of using e-mail" is exactly what needs
to happen, and not just because of the spam epidemic.
|
Although I don't think this article has the right solution, I don't see
a problem with redesigning the email method.
If a "spam-free" email exists in parallel with the email as we have it
now, I will divert the spam-free mail to my inbox, and the spammy mail,
through a filter, to a junk-suspect folder to be checked once a week.
Of course, this spammy mail will get an auto-reply that tells the sender
how to contact me using the spam-free protocol. After a while, I am
certain the people I really want to hear from will all use the
spam-free protocol, and I will stop checking the regular email, after the
changing the auto-reply to "your mail has just been ignored".
The key to get a massive retooling accepted is by using the original one
in parallel. It will die off soon enough.
|
I don't care about the retooling. The core problem with this idea is the
concept of handing out codes. Retraining users will be a pain in the ass.
Not to mention just tracking the codes would suck. Say I meet someone in
real life. How do I have a code to give them? Do I have to have a
buisness card with codes on it? Do I need a computer just to exchange an
email address.
If you want to use codes or signatures, It has to be server side. That
way a server could sign an email as valid. Then my email client could
take that signature under advisement when the email actually arrives.
The difference in action is an excercise left to the reader. However its
alot nicer in the ease of use hurdles and doesn't screw old setups.
|
I wouldn't say he reinvented the whitelist. The whitelist is based on
resending an e-mail after it's bounced back to the sender because it's
an unrecognized e-mail address. This technique relies on something that's
similar to public/private keys, with a dynamic code that helps detect
true users from automated ones.
My main gripe (that I just realized) is that some e-mail must be send
automatically, like web server confirmations. They would get sent into
your "other" inbox with the thousands of spam messages if you lacked the
persons "code".
|
Personally I rally liked D. J. Bernstein's (qmail, djbdns, daemontools)
idea for a new mail protocol. The big difference between it and mail we
have now is that only the notification of mail is sent, not the mail
itself. The mail sits on the senders mailserver, waiting to be picked
up, and if you want to retrieve it, your mail client does so from his
server. Think about it - No more anonymous spam, since you KNOW where
messages are coming from if you have to retreive them. Therefore, if
spam is illegal, we can punish them... and there is no more faking of
where its coming from.
The other cool concept to that is mailing lists vs bandwidth. In old
mailing list styles, a message would go out to the list, bouncing back
from all people whos boxes are gone or full- witha lot of traffic.
In DJs new way, there is only notification of the message sent, and
then only those who really want the message download it.
The more you think about it, the better of an idea it becomes. In the
wold of terrifying ideas like "postage for emails" or "really super-mega
-expensive domain names for mail only" Bernsteins has an elegance and
practicality I haven't seen elsewhere.
|
The big difference between it and mail we have now is that only the
notification of mail is sent, not the mail itself.
Options:
a) Notification contains no sender-modifiable content. No way to know
if you want it or not. You say yes and wind up with spam from
unknown server.
b) Notification winds up containing the entire spam as subject line,
and the supposed server it's coming from doesn't exist.
c) Spammers break into millions of unsecured Windows boxes and run
'mail servers' on them.
Nice try, but no cigar.
|
I administer a mail server for a small ISP. The problem with filtering on
the user's end is that my costs are consumed by the time the user deals
with the spam. I don't think, as the article suggests, that spammers will
slow down if their message is not being read, in fact they will just spew
out ever more spam. If a 1/10 of 1% hit rate does not deter them, a
smaller hit rate won't either.
I have to put some upper limit to the amount of storage I can give each
person (right now I allow 100M, which I think is quite reasonable). But
if a user goes on vacation and does not check their e-mail for a month,
they could have their inbox filled with spam and viruses (not much
difference these days, from a server admin point of view). This will
preven legitamate messages from coming through. Therefore, I use the
following technical measures to help reduce spam:
RBLs: dnsbl.njabl.org, sbl.spamhaus.org, xbl.spamhaus.org,
and dul.dnsbl.sorbs.net
SPF:Sender (not adopted widely yet, but it does block a few messages
a day even now)
Blocking specific subject lines (during virus outbreaks this can help)
Blocking mail "from" non-existant domains
I really have no choice, I cannot afford not to take these measures.
I explain all of them to my clients, nobody has had a problem yet.
These measures catch roughly 75% of spam and viruses, and as far as
I know, no false positives.
|
A portion of my e-mail "Inbox" on 2004 March 29th as manifested by the
"Microsoft Outlook Express 5" application. On this date I received 9
"legitimate" messages, 77 spam messages, and 2 virus attachments.
And later:
cpfahey@earthlink.net
Outlook Express, public e-mail address and he is complaining about spam.
Surprise, surprise!
|
|
CONTACT INFORMATION
Colin P. Fahey
Irvine, California; USA
cpfahey@earthlink.net
http://www.colinfahey.com
|
|