ABSTRACT

The term "spam", in the context of electronic communication, is typically understood to mean:

Spam: An unsolicited message sent to a large number of recipients.


It is commonly believed that a spam message for a given product or service sent to thousands or millions of people will generate very few actual customers for the sender of the message. Even if the ratio of attracted customers to total spam message recipients is only 1/10000, or even much lower, the overhead costs of spamming is so low that a net profit is possible. In fact, for many of the spam products, selling a single product, or scamming a single victim, may be the break-even point for the business model. Several factors contribute to the perception of the spam phenomenon as a pressing social and technological problem:

(1) Spam wastes millions of hours of humanity's time each day with the task of differentiating between spam messages and "legitimate" messages; (2) Spam consumes a significant fraction of total Internet bandwidth, which causes both a slowdown of other traffic, and possibly raises overall bandwidth cost; (3) Spam consumes a large amount of storage space on mail servers, sometimes actually making it temporarily impossible for "legitimate" messages to be received; (4) Spam can be the vehicle of "identity theft" campaigns, other types of fraud, or virus propagation;


This article describes the flaws with many of the simplistic methods proposed or employed in past attempts to reduce the levels of the factors listed above. A more robust solution is proposed.




The Original SPAM!

SPAM is a pork and ham food product produced by Hormel Foods (http://www.hormel.com):

"SPAM Classic is a conveniently packaged canned meat product made of 100 percent pure pork and ham. SPAM Classic contains 180 calories per two-ounce serving. SPAM luncheon meat first was produced in 1937. It was one of the first convenient, moderately priced and great tasting meat products on the market."


Hormel Foods has written an article commenting on the use of the term "spam" to refer to unsolicited e-mail: http://www.spam.com/ci/ci_in.htm From the article:

Ultimately, we are trying to avoid the day when the consuming public asks, "Why would Hormel Foods name its product after junk e-mail?"


Briefly, the use of the word "spam" to mean "driving insane with relentless, monotonous bombardment" is directly attributed to a "Monty Python's Flying Circus" (humorous BBC television series) skit that essentially celebrates SPAM. A restaurant patron discovers, with chagrin, that everything on the menu contains SPAM -- such as: "[...] spam spam spam egg and spam; spam spam spam spam spam spam baked beans spam spam spam...". The mention of spam rouses Viking restaurant patrons to break in to song: "Spam, spam, spam, spam. Lovely SPAM, wonderful SPAM!", driving the frustrated restaurant patron mad.
A detailed article on the "Origin of the term 'spam' to mean [network] abuse" was written by Brad Templeton: http://www.templetons.com/brad/spamterm.html




REASON FOR THIS ARTICLE

I am very concerned by various "solutions" to the spam phenomenon that involve:

(1) Invasion of Privacy; (2) Censorship; (3) Payment and Cooperation with a Commercial Entity.


The very day I started writing this article (2004 March 29th) I heard a report on the BBC World Service (rebroadcast on a local public radio station) featuring an interview with a person affiliated with a company that will offer a "new kind of spam filtering" on a paid membership basis. The method relies on monitoring Internet traffic, searching for "identical" e-mail messages coming from a common source. Suspected spam e-mail is analyzed further to discover any links to sites previously associated with spam efforts. This service will fail due to various scenarios described in this article. However, my biggest concern when hearing proposals like this is that the public will embrace the proposed solution without fully considering issues of invasion of privacy, censorship, and corporate interests. Clients of various online services are subject to the contracts of the service providers. I have no complaint with that because I can choose to avoid service providers with terms I do not like. My concern is that the current conversations about solutions to the spam phenomenon will lead to a wide acceptance of terms that go against principles I consider important -- and I believe a significant fraction of the people who would willingly accept such terms might not be so accepting if the impact of such terms on privacy, freedom of communication, and freedom from the influence of corporate interests, are described in a way that makes the issues very relevant and personal.






GALLERY OF SPAM

This section presents contemporary examples of spam, with some analysis and related information. Although this section is based on spam I have personally received, I believe my experience is typical of users of e-mail. This section is intended to sketch the basic principles of spam. An attempt at a formal definition of the term "spam" will be postponed until the next section. Presenting examples in this section will make subsequent formal discussion less abstract.




1. My E-mail "Inbox"

Over the past few months I have received an average of roughly 100 spam messages per day, and I generally receive several viruses as e-mail attachments each day. Earlier this year, from 2004 January 15th through 2004 February 8th (25 days), I received 2872 spam messages, of which 207 were viruses; which corresponds to an average of 114 spam messages each day, and an average of 8 virus attachments per day.



FIGURE: A portion of my e-mail "Inbox" on 2004 March 29th as manifested by the "Microsoft Outlook Express 5" application. On this date I received 9 "legitimate" messages, 77 spam messages, and 2 virus attachments.

SENDER NAME AND SUBJECT: ======================== One of the striking features of most spam messages is that the disingenuousness starts almost immediately with the alleged sender's name. The fact that almost every spam message has a fake sender name cheapens the whole concept of the sender name. Of course that is just the beginning of the erosion of trust, but I nonetheless pause and consider the bizarre act of a spammer producing a fake sender name. Spam messages promoting "male sexual performance" drugs and pornography often have sender names that are female. Interestingly, the subject associated with a spam message often really does contain an accurate summary of the spam message. But, as one can see in the small set of subject items above, some spammers believe it is possible to do without sensible descriptions of e-mail content. Eventually both the sender name and subject line will be recognized by the public at large as totally meaningless claims associated with the messages, which is a reflection of the actual technical fact: these fields are totally unreliable for determining the origin and content of e-mail messages.



WHAT IS ALL THE SPAM ABOUT? =========================== I did a simple analysis of the spam messages received on three recent dates:

(1) 2004 March 29th : 77 spam messages total; (2) 2004 March 30th : 98 spam messages total; (3) 2004 March 31st : 121 spam messages total;


The following is a rough classification of the messages received:

MEDICATION: ------------------------------------------------------------------- PENIS-ENLARGEMENT: Viagra, Cialis, NaturalGain, "Weekend Pill", Viagra Patch: 18/77, 17/98, 16/121 ALTERNATIVE-SOURCE PRESCRIPTION MEDICATIONS/PSYCHOTROPIC DRUGS: Levitra, Phentermine, Vicodin, Valium, Ambien, Xanax, Tramadol, Lipitor, Propecia, Zocor: 14/77, 18/98, 19/121 Marijuana-like product/ Mood Enhancers/Herbal Meds: 1/77, 0/98, 0/121 DIET/NUTRITION: Diet Pills/Patch: 3/77, 3/98, 3/121 Anti-Aging/HGH: 1/77, 0/98, 1/121 SMOKING: Cigarettes: 1/77, 1/98, 3/121 HEALTH AID: Snoring Control: 1/77, 0/98, 0/121 ------------------------------------------------------------------- TOTAL: 39/77(50%), 39/98(40%), 42/121(35%)


FINANCIAL: ------------------------------------------------------------------- LOANS/CREDIT: Refinance Mortgage/Equity Loan: 13/77, 12/98, 11/121 "Cancel Debt" (somehow): 0/77, 1/98, 8/121 Car Loans: 0/77, 2/98, 1/121 Payday Cash Advance: 1/77, 1/98, 0/121 Unsecured MasterCard/Credit: 1/77, 0/98, 1/121 INVESTING: Investor/Stock Alert: 5/77, 5/98, 3/121 INSURANCE: Life Insurance: 1/77, 1/98, 2/121 Healthcare: 1/77, 0/98, 0/121 Auto/Warranties: 1/77, 0/98, 0/121 BUSINESS OPPORTUNITIES: "Work" on eBay: 1/77, 6/98, 4/121 Own Resort: 1/77, 0/98, 0/121 "Network Marketing": 0/77, 0/98, 1/121 Real-Estate Auctions: 0/77, 0/98, 1/121 GAMBLING: Poker/"Earn Money Playing Lotto!": 0/77, 1/98, 2/121 SPAMMING: Spam 27 million people: 0/77, 1/98, 0/121 ------------------------------------------------------------------- TOTAL: 25/77(32%), 30/98(31%), 34/121(28%)


SOFTWARE/CONTENT: ------------------------------------------------------------------- PORNOGRAPHY: Porn (farm sex, schoolgirls, girls gushing, web cam, monster cocks): 1/77, 1/98, 6/121 PARANOIA/SNOOPING: Software to Learn about People: 1/77, 0/98, 0/121 Scan PC: 1/77, 0/98, 0/121 Keyboard Logger: 0/77, 1/98, 0/121 PIRACY: Cheap software/OS: 2/77, 8/98, 5/121 DVD copying: 0/77, 2/98, 0/121 Cable Descrambling/ Free "Pay-Per-View"(!): 0/77, 2/98, 0/121 ------------------------------------------------------------------- TOTAL: 5/77(6%), 14/98(14%), 11/121(9%)


MALICIOUS/FRAUD: ------------------------------------------------------------------- VIRUS: Virus (Mail "Delivery Failed" type with attachment): 2/77, 0/98, 1/121 IDENTITY THEFT: Web-based "verification" (PayPal,eBay,Fleet Bank): 2/77, 2/98, 0/121 ------------------------------------------------------------------- TOTAL: 4/77(5%), 2/98(2%), 1/121(1%)


MISCELLANEOUS: ------------------------------------------------------------------- Unknown: 2/77, 6/98, 18/121 Blind date/dating: 0/77, 0/98, 5/121 Earn Degree/Degree without Tests: 0/77, 1/98, 3/121 "Colin, Grow 2 Cup Sizes -- FREE!", Bigger Breast From Pill: 0/77, 1/98, 2/121 Vacation Deals: 1/77, 1/98, 0/121 Your Opinions might make you 1000: 0/77, 1/98, 1/121 Hair Transplants: 0/77, 1/98, 1/121 Misc. Deals: 1/77, 0/98, 0/121 Luxury Sheets: 0/77, 1/98, 0/121 Free Samsung Mobile Phone: 0/77, 1/98, 0/121 Hypnotic MP3 for Depression, Self-Esteem, Motivation: 0/77, 0/98, 1/121 Wristwatches (Rolex,etc): 0/77, 0/98, 1/121 Print Own Postage: 0/77, 0/98, 1/121 ------------------------------------------------------------------- TOTAL: 4/77(5%), 13/98(13%), 33/121(27%)


SUMMARY: ----------------------------------------------------------------------- MEDICATION TOTAL: 39/77( 50% ), 39/98( 40% ), 42/121( 35% ) FINANCIAL TOTAL: 25/77( 32% ), 30/98( 31% ), 34/121( 28% ) SOFTWARE/CONTENT TOTAL: 5/77( 6% ), 14/98( 14% ), 11/121( 9% ) MALICIOUS/FRAUD TOTAL: 4/77( 5% ), 2/98( 2% ), 1/121( 1% ) MISCELLANEOUS TOTAL: 4/77( 5% ), 13/98( 13% ), 33/121( 24% ) ----------------------------------------------------------------------- TOTAL: 77/77(100%*), 98/98(100%*), 121/121(100%*) (*...Percentages in this table are rounded and do not add to 100% with shown precision.)


ANALYSIS: ========= Medication is the most frequent topic of spam messages during this three-day sample. Two types of medication supply services dominate in this category of spam messages: (1) Penis enlargement; (2) General pharmacy "needs" (often drugs that are expensive in the domestic US market, and drugs which reputable doctors may be hesitant to prescribe due to lack of medical justification and potential for abuse). Spam promoting penis-enlarging drugs are typically very informal, using phrases like: "Haha, U Have A Real Small Pe-nis", "Is Your Me.mber too Teeny?", "Screw ur lover like never before", etc. Financial topics were very common among the spam messages during this three-day sample. Home mortgage loans and refinancing offers dominate this category of spam messages. Investor "stock alerts" are also common. During this period, the "making a fortune on eBay" plan was significantly promoted. My personal favorite scam concept in this category arrived with the subject: "Earn Money Playing Lotto!". Software and media content are popular spam topics. Offers of inexpensive software dominate this category; there is no doubt that this software is pirated, despite explanations of how, for example, one can buy Windows XP for $32 USD instead of paying $286 USD. Spam promoting pornographic web sites is also common in this category. My personal favorite offer is for a product that will give a person "Free [Pay-Per-View]" -- an oxymoron if one doesn't consider the fact that the product itself actually costs money. Another really interesting sub-category in spam regarding software products is software designed to address a person's paranoia -- such as software to scan a person's PC for spyware, or software to spy on children and spouses using the family computer, or software to learn about public records on others (or oneself!). The irony is that installing such software will lead to the very things the target spam recipients fear most. Of the miscellaneous topics of other spam messages, alleged "blind dates" are frequent, along with offers to earn various degrees (often just by paying a small fee; no testing or qualifications necessary!). My personal favorite is an offer with the subject: "Colin, Grow 2 Cup Sizes -- FREE!".




2. Notable Spam from the Years 2001-2003

The following images were taken from spam messages I received during the period 2001-2003.



FIGURE: I received this spam message on 2001 September 21st, 10 days after the World Trade Center buildings were destroyed by fires following plane crashes directly caused by terrorists. This spam message, offering, among other things, a bumper sticker that advocates a plan to "Nuke Afghanistan", demonstrates that spam can be very political. Following the US initiation of the war on Iraq in 2003, spam offering "Terrorist 'Most-Wanted'" playing cards, depicting 52 people targeted by the US anti-terrorism effort, arrived almost daily in my e-mail "Inbox" for many months. It is important to consider politically-motivated spam, or spam efforts that seek to profit on hate.



FIGURE: This creepy spam message, like most spam messages, addresses some sort of personal insecurity. This same product, in a funny coincidence, was also promoted as a way to spy on naked women -- essentially advocating a means to violate someone else's personal security.



FIGURE: This spam message, which I received some time in the year 2002, makes an indirect reference to the film "The Matrix": [Morpheus offers Neo a choice between two pills, one blue and one red.] "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe." "You take the red pill...and I show you how deep the rabbit hole goes." Although in the film it is the red pill (not the blue pill) that will result in being shown "how deep the rabbit hole goes", the humor of this Viagra spam message is hardly diminished. Compared to penis-enlargement product spam messages of 2003 and 2004, this spam message is fine art!



FIGURE: This spam message, which I received some time in the year 2002, is the most outrageous invitation for irony I have ever seen. The idea of promoting a sense of security by installing an application that is completely invisible and secretly records all instant messages, chat, e-mail, web sites, etc, is perverse. Naming the application "IamBigBrother" is hilarious!



FIGURE: Apart from the fact that the product one might receive after responding to this offer is likely to actually contain viruses rather than prevent them, I chuckle at the hypocrisy of a spam message that proclaims: "Norton Spam Alert filters unwanted e-mail!"





3. Non-technical spam examples from 2004

The following images were taken from spam messages I received during the year 2004, illustrating miscellaneous non-technical points.



FIGURE: This spam message is, apparently, an invitation to start a career in sending spam messages! (Of course visiting the specified web site address could lead to viruses, such as spyware or a trojan mailer.) I can't help it: I love the bravado of the author of this message! My sober judgment says this person's position is very wrong, but there is something very human about this message that resonates with me.



FIGURE: This spam message promotes a service to allow one to send spam messages to "27 million people". I suppose receiving this message itself lends some (i.e., 1/27000000th) credibility to the sender's spamming capability.



FIGURE: I like the domain name: "YetAnotherDomainName.com" (created 2004 January 29th, and resolving to 216.177.88.181 at the time of this writing, 2004 April 3rd.) This domain name actually captures the spirit in the spammer realm, where purchasing new throw-away domain names to launch the next spam campaign is a small price to pay to avoid spam obstacles, like IP blacklisting or cancellation of service by the web hosting provider (who discovers, too late, that a host was rented for use in a spam effort).



FIGURE: I like this one; just sign up and start making money -- while doing "Absolutely Nothing!". Scams are based on greed, and this example is one of the purest appeals to greed I've ever seen.



FIGURE: This spam message promotes a book whose only possible positive use is by law-enforcement officers trying to keep up with criminals. I guess I might find such a book intellectually interesting, much like the articles in "2600" magazine, and I support being able to freely share any ideas whatsoever, but these facts do not make this book, and others like it, any less repulsive to my ethical sensibilities. How lonely and hostile a world it must be for a person to seriously consider the subjects in such publications. Some of the items seem somewhat contradictory: "Surf Internet Anonymously" while also demonstrating that one can "Hack into other [people's] computers remotely"; Or, "How to get fake identity documents", but demonstrating "tracing and tracking people [...] to find those who don't want to be found".



FIGURE: This spam message teases me with one of the central mysteries of spam: How can anyone buy medication (which enters and affects one's very body) from a business that thinks it is okay to use a fake sender name, use a humiliating taunt as a subject, and make numerous spelling errors in the promotion, and end with a collection of random words? (If a person is savvy enough to recognize the fact that the misspellings and random words are methods of evading spam filters, all the more reason to mistrust the business attempting to sell the medication!) But, on the other hand, I am baffled by the state of mind of the spammers who send messages like these. What world-view makes it okay to send messages like these? (NOTE: I strongly agree with the principle of free communication, but I still wonder why people choose to express certain ideas.)



4. "Identity Theft" examples from 2004

The following images were taken from spam messages I received during the year 2004, illustrating "identity theft" efforts. The basic concept is to convince the message recipient that it is necessary to gather personal information, often to "prevent an account from expiring", or for the recipient's "security" and "protection".



FIGURE: This spam message is admirable in its professional appearance and for its outrageous inclusion of phone numbers and web site addresses to help the victim gather information to be scammed more efficiently and completely.



FIGURE: The PayPal scam e-mail message in the previous figure includes the JavaScript code directly above -- which repeatedly writes the text "http://www.paypal.com" to the browser status bar (lower-left border or Internet Explorer, for example). Thus, when the user hovers the mouse over the critical links in this spam message, the actual link (which would be a potential give-away that this is a scam) is quickly clobbered by the text "http://www.paypal.com". Only someone watching the status bar intently while moving the mouse on and off the various links (without clicking) would see the brief flash of the real target web site address.



FIGURE: This "identity theft" scam, allegedly from Citibank, which I received on 2004 April 4th, makes a direct request for a debit-card PIN (Personal Identification Number), which can be used to withdraw cash. This must be a psychology experiment, or a massive IQ test, or perhaps the "Spam & Bologna" contest (in progress at the time of this writing).



FIGURE: The "Citibank" scam spam message in the previous figure has the HTML code shown directly above.



FIGURE: This eBay scam isn't as elaborate as the PayPal scam above, but it probably looks sufficiently professional to be effective.



5. Virus attachment examples from 2004

The following images were taken from spam messages I received during the year 2004, illustrating virus attachments. If I had a spare computer I'd be tempted to download as many viruses as possible and have all the viruses fight it out for control of computer's resources. "Ready...FIGHT!"



FIGURE: I have to say: This e-mail with virus attachment is a bit of a contemporary classic. I've never actually tried the virus out, you know, to see if it fit my lifestyle, but, hey, "500.000" people can't be wrong!



FIGURE: Wow, this virus attachment has quite a background story!



6. Trivial obfuscation example from 2004

The following images were taken from spam messages I received during the year 2004, illustrating use of trivial obfuscation.



FIGURE: This is a trivial form of obfuscating the content of HTML to thwart text-based e-mail message content filters. The dummy HTML tags break up the text that will ultimately appear in the HTML document. The solution to this direct problem is to eliminate or consider the effect of HTML tags before scanning for spam-indicating words, but this is just the beginning of the hopelessness of content scanning, as will become more evident in the subsequent sections of this article.



7. Unicode examples from 2004

The following images were taken from spam messages I received during the year 2004, illustrating Unicode use.



FIGURE: "Unicode" characters allow the characters of major world languages to be placed in documents, such as documents in the HTML format. The spam message shown above, which I received in 2004 March, illustrates a conventional use of Unicode characters -- in this case to represent letters of the Russian (Cyrillic) alphabet.



FIGURE: Spammers have found another use for Unicode characters: displaying characters that look like English letters, but in fact are letters and symbols from other world languages. Thus, English readers, humans, have no trouble reading the text, but automated content scanners will fail to detect the presence of "spam-indicating" words. Of course one solution is to build up a table of how Unicode characters visually relate to English letter and number characters. But, given the large number of Unicode characters visually "compatible" with various English letter and number characters, this effort is likely to be intractable. Combine this with strategic misspellings and random interjection of punctuation, and content filters are doomed to fail. I suppose an isolationist American could block all e-mail containing Unicode, but even plain English characters can be used in creative ways that humans have no trouble reading but present an intractable problem for content scanners. A filter that rejected ungrammatical content, or content with many misspellings, would likely claim a large fraction of "legitimate" domestic e-mail!





8. Off-topic text examples from 2004

The following images were taken from spam messages I received during the year 2004, illustrating use of off-topic text.



FIGURE: This spam message includes a paragraph from a formal text (in this case a Travel Warning issued from the United States Department of State on 2004 March 23rd: http://travel.state.gov/israel_warning.html; I discovered this by a Google search for "curfew should remain indoors"). The meaning of the text isn't as important as the fact that it is grammatical, potentially "interesting", and large enough to greatly "outweigh" any spam indicators that might be detected elsewhere in the message.



9. Base-64 encoding of HTML examples from 2004

The following images were taken from spam messages I received during the year 2004, illustrating use of base-64 encoding of HTML. "Base-64 encoding" is a method of representing sequences of Byte values by a sequence of ASCII characters within the following set of 64 characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ Thus, the character 'A' corresponds to an integer value "0" (zero), and the character '/' corresponds to an integer value "63" ("111111" in binary). Groups of three Bytes from an input sequence are regarded as a sequence of 24 bits. Four 6-bit values are extracted and converted to the corresponding characters in the set above. If the input sequence has a total number of Bytes that is not a multiple of three, the input sequence is padded with zero-value Bytes, and the output sequence is padded with '=' characters. Base-64 encoding is typically used to allow binary data, such as e-mail file attachments (with extensions ZIP, JPG, MP3, DOC, EXE, etc), to be contained in the plain-text body of an ordinary e-mail. Thus, text-based operations can be conducted on mail archives without worrying about encountering non-ASCII characters, or problematic "control characters" (such as "Null"/NUL/0x00/^@, "End of Transmission"/EOT/0x04/^D, etc). However, spammers have used base-64 encoding as a simple means to obfuscate their HTML content. Thus, very simple text filters, or human readers, cannot easily examine the content of such spam messages. Sure, it is easy to add a base-64 decoding stage to a content-based spam filter, but this is yet another example of the increasing complexity of spam detection.



FIGURE: This is the plain-text appearance of a spam message containing HTML encoded as base-64.

The following C code compiles to a very crude base64-to-text conversion utility. One must manually place a base-64 block of text, by itself, in a text file, and then use this utility to generate text output -- which, by the command line, can be directed to an output file. I only wrote this code as a learning experience, and to convert samples of base-64 from e-mail messages. I offer it for illustration, and perhaps inspire you to seek a "real" base-64 decoding utility elsewhere.

Download C Source Code for Base64-to-Text Conversion Utility: base64totxt.cpp [2146 Bytes; CPP file]







FIGURE: This is the plain-text decoded form of the base-64 content in the spam e-mail above. Apart from revealing the web site address promoted by the spam message, the decoding also reveals "random" words designed to "dazzle" Bayesian spam filters (such as the Mozilla browser Bayesian spam filter). It is important to note that this spam-filter countermeasure was placed within a base-64 encoded block -- which means the spammer assumes that base-64 blocks will be decoded and subjected to content analysis. One more VERY IMPORTANT thing to note about this example: The spam message is totally contained in the image specified by the HTML image tag, stored on a remote server. Thus, unless a content-based filter notes the "tabs", "biz", and "pills" parts of the web site address and file paths, this message is totally benign. (A better spammer would not put "tabs" and "pills" in any part of a URL. An explicit IP could eliminate the explicit "biz" domain part -- although a reverse DNS lookup could find the "biz" part.) The point is that it is trivial to eliminate any evidence that the message is spam. Sure, one could add a new spam detection rule that flags e-mail messages that only contain HTML image tags, etc, but the risk of flagging legitimate e-mails in the process is high.



10. Example of tracing spam from 2004

This section describes a modest effort to learn more about a spammer.



FIGURE: Despite the "explanation" of why this business is able to offer such low prices on software, this spam message was clearly sent from a software pirate. If indeed one can actually pay money and initiate a file download from this spammer's server, there is no assurance that the file will be the actual application, as opposed to a virus, or a virus-infested version of the desired application. In any case, for the purpose of illustrating the search for information about sources of spam, I chose to lookup the owner of the domain (associated with a URL found in the HTML code of this spam message): "buye-soft.biz".



The search begins by going to the "whois" search page of the InterNIC web site:

http://www.internic.net/whois.html


FIGURE: I assume ALL of the information returned by this "whois" search is fake -- except for the trivial details: Domain Name, Domain ID, Domain Status, Registrant ID, Name Server, Created By Registrar, Last Updated by Registrar, and the dates. The only really interesting part is the "Domain Registration Date", which in this case indicates that the domain was created less than one week prior to my receipt of a spam message containing links to a web site in this domain. I have seen this many times -- domains registered within days of corresponding spam messages (containing links to hosts within those domains). As mentioned earlier, the $20 one-year domain registration fee, and a $20 web hosting fee, makes for a $40 total overhead to starting a spam campaign -- and a single customer, or a single scam victim, can easily surpass the break-even point of the business model!





DEFINITION OF "SPAM"

Spam: An unsolicited message sent to a large number of recipients.



Unfortunately, that simple description has subjective terms.
The subjective terms are considered in this section.

A. "Unsolicited"

Whether or not an e-mail message was "solicited" (i.e., the recipient explicitly or implicitly requested the message in advance) is one of the most-debated issues related to spam.

"Registration" for Online Services Many online services require "registration" -- a process whereby an entity (e.g., corporation or organization) gathers personal information from a person who desires to use their service. The registration process may demand an e-mail address. The registration process may enforce this demand by sending an e-mail to the prospective user of the service with a piece of information essential to the completion of the registration process (such as a password, or a temporary URL to visit) thus validating the (potentially temporary) association between the e-mail address and the user. If the service provider is an ethical corporation (i.e., one whose controlling power is not beholden to the interests of profit alone, but instead has ethical conduct as part of its mission), then very early on in the registration process there will be a full description of how all pieces of information will be used (and, perhaps more importantly, not used). Unfortunately, e-mail "notifications" and offers from "partners and affiliates" is often part of online service contracts. The Abuse of the "Opt-in" Concept Although the practice is fading as spammers simply give up on maintaining the illusion of compliance with laws, for a time (around 2000-2002) it was common to see spam messages ending with a disclaimer that essentially stated the following: "This message is not spam. You are receiving this message because you requested this message from this service, or opted-in to mailings from one of our affiliates." One can either ponder in amazement that Citibank, Bank of America, or Microsoft would be the "affiliates" of pornographers, illiterate Viagra purveyors, wireless spy cameras, and software pirates -- or conclude that the disclaimer is a total falsehood. It is not difficult to imagine that a giant corporation might have a slightly less monitored affiliate with slightly lower standards of business ethics. And it is easy to imagine that through the corporate equivalent of "Six Degrees of Separation" that eventually personal data submitted to, say, any giant corporation might actually, through a chain of "affiliation", make it down to any given business, or outright criminals, on our planet.


ASIDE: Spam Analogy in Other Forms of Indiscriminant Communication It is interesting to note that billboards on the sides of freeways, busses, and taxi cabs, might qualify as a kind of government-approved visual spam. I bet if a proposition to eliminate all billboards was placed on a state ballot, the overwhelming majority would vote in favor of the proposition. The fact that billboards clutter our visual space testifies to the gap between local government and its constituents. I think it is interesting to consider spam in relation to billboards that essentially radiate data by photons in all directions without regard for recipients, or loudspeakers that radiate data by sound waves radiating in many directions without regard for recipients, or mass-distribution of a single message via postal mail, facsimile machines, telephone, or fliers. Some of these "spam" variations rely on proximity, so the "technology" to avoid such spam is simply to move away from the emitter. But other variations essentially bring the message very close to targets, such as postal mail or a telemarketing phone call. Here, the "technology" to avoid being distracted with the task of differentiating between spam and desired messages is limited to "refusing bulk postal mail by request" and becoming part of the "national 'do-not-call' list" (which relies on vigilant consumers, and laws to act as a deterrent for would-be violators).





B. "Message"

Obviously the term "message" must be used in the broad sense, referring to the "intention" of the data received by each recipient. Otherwise, permuting the sentences of one message to construct a second message, for example, would be considered distinct "messages".

Unique Text per Recipient Almost all spam messages today have contained procedurally- generated text that is unique per recipient. Some of this procedural text is totally-random words (to defeat word-frequency filtering, or word-pair Bayesian Filtering), or random-but-grammatical (to defeat more advanced grammar-based filtering), or random text from online news and information sources (which is likely to be overwhelmingly approved by any filtering). Also, the real message text, the intended communication, can be procedurally modified to be unique per intended recipient. This can occur at the character level, word level, sentence level, and paragraph level. Misspellings can be introduced, especially using look-alike characters ('0' vs. 'O'; or Unicode characters) or transposing adjacent letters within a word ("Viagra" vs. "Vaigra"). Sentences can be permuted. So, defining "message" as a literal sequence of characters (or bytes) will fail to identify "spam".







METHODS WHICH FAIL TO SIGNIFICANTLY
REDUCE IMPACT OF SPAM

1. LAWS

The assumptions leading to the creation of laws to indirectly reduce the impact
of spam include:

(a) The existence of a tough law against spamming will serve as a sufficient deterrent for potential spammers; (b) The person ultimately responsible for the spam messages can be identified;



Reasons why laws cannot stop spam include:

(1) Messages can originate in countries that either do not have spam laws or do not have sufficient resources to enforce spam laws; (2) Spam may originate from any of the billions of people on our planet with Internet access, and although laws may lead to a sufficient psychological deterrent for the vast majority, it only requires a few people to generate billions of e-mail messages per day. (3) The connection between businesses and spamming campaigns will be increasingly difficult to make, especially if there are a few cases in which competitors or hackers seek to implicate a company as a spammer by secretly initiating a spamming campaign on that company's "behalf". Such a scenario, among others, would introduce a level of doubt about the simplistic argument that the party that benefits from spam must be responsible for the spam. There is, rightly, a large amount of plausible deniability in this context. One particularly difficult variety of spam campaign to prevent with legal deterrents is one launched spontaneously by a person on his or her own initiative on behalf of a political cause or some other large-scale phenomenon. A person with a modest amount of programming ability can compromise any mail system and use it to spread a message that was not endorsed by the organization that might ultimately benefit. The individual spammer promotes a cause, which, if advanced, will somehow end up benefiting the spammer. It's a kind of "advocacy laundering", relying on the large number of advocates and potential beneficiaries. Examples include: promoting specific stocks; promoting political candidates or legislation; promoting hate; etc. (4) Although this is more of an observation than an explanation, it is interesting to consider that many spam messages are far more illegal than simply wasting people's time with unwanted advertisements. Spam messages can be used to deploy viruses (destructive, spying, use of computing resources) or play a role in the process of "identity theft" (such as directing a person to a fake clone of a banking or service web site and requesting confidential information, often, ironically, with the claim of "increasing security" through "verification"). The people behind such spam efforts are way beyond considering the consequences of spam-specific laws.




2. IP/E-MAIL ADDRESS BLACKLISTS

The following describes the concept of an IP/e-mail address blacklist:

Each time a complaint about spam originating from a specific IP address (or an entire subnet or domain) or from a specific e-mail sender address is reported, the source is added to a list (blacklist) that is used as a basis for deciding whether or not to allow future e-mail messages to pass.



Reasons why a blacklist method cannot stop spam and would lead to many problems
and ethical issues include:

(1) Spammers can easily change the origin of spam. Domain registration is as low as $20, and renting the use of a web server in a data center can be very inexpensive and without significant commitment. Do a "whois" lookup on the domain name associated with a spam message (if any such links exist in the body), and you may discover that the domain was established just a few days prior to your receipt of the spam message -- a delay just long enough for the domain registration to propagate to DNS servers world-wide. Even if the blacklisting occurs on the same day as the spam campaign, the spam has already reached all intended recipients. (2) Blacklisting introduces the risk of a result that is incomprehensibly bad: The diminishing of free communication. China blocks many web sites (including, apparently, my own site). Some corporate web sites refuse traffic referred by other sites. Search engines, which many people rely on as an "unbiased representation of web content", impose their own content filtering and prioritization (sometimes to avoid lawsuits, or to serve the paid interests of business partners or investors). Blacklisting is opposed to democracy, where people depend on being able to gather or receive information without any bias in the mechanism of gathering or receiving. When major "news" sources fail to deliver information without bias, the Internet may be our only recourse in the quest for the facts. Any mechanism that interferes with such a quest is opposed to democracy. Even "gate-keepers" with good intentions can lead to a condition in which freedom of expression and reception of ideas is greatly diminished without a corresponding degree of benefit from the act of filtering information. The best type of decision-making involves having as much reliable information as possible, and many important decisions depend on learning about conditions that are frequently changing. Blacklisting interferes with such decision-making. (3) Malicious people can provoke the blacklisting mechanism in to blocking domains, subnets, or e-mail addresses. By studying the blacklisting mechanism, one can devise a method to use it to one's advantage. Hackers can cause the blacklisting of anything capable of being blacklisted, perhaps simply by generating "spam" on "behalf" of each intended blacklisting target. (4) Automated blacklisting cannot differentiate between non-critical links and critical links in spam HTML code. Thus, a spammer can populate e-mail messages with links to reputable web sites not actually affiliated with the spam campaign (such as putting real links to PayPal, eBay, Earthlink, Citibank, etc, to make a convincing clone of an actual corporate announcement). Automated blacklisting may blacklist legitimate web sites or domains, and miss the highly obfuscated links to the site associated with the spam effort. (5) Blacklisting will fail to achieve any beneficial result and will cause many random people communication problems in the case of spam generated by distributed mail servers (as with Trojan viruses that install e-mail servers on all infected systems). Not only can the e-mail come from any of thousands of infected machines, but can even offer working links (URLs) to tiny web servers in any of the thousands of infected systems. The hacker is free to link in to the peer-to-peer virus system and gradually and passively accumulate, for example, "identity theft" information. Blacklisting would likely block random individuals, and possibly many random domains. The virus need not be a stand-alone, clandestine application, but can in fact be a hacked form of a standard peer-to-peer application. Given the capabilities and lack of attention to high-security for popular peer-to-peer applications (let alone the applications people share and execute), conveying a patch or script to transform a file-sharing application in to a platform for spam delivery and for web page serving (i.e., totally decentralized "web site" with thousands of radically different URLs with the same content) is trivial.




3. GENERIC HUMAN-SKILL CHALLENGES

The following describes the concept of a generic human-skill challenge:

If a resource, such as a communication channel, is only intended for use by humans, and not automated systems, then a challenge can be designed to determine if a request for access to the resource comes from a human. A generic human-skill challenge is a task that the overwhelming majority of adult humans can easily accomplish without any prior knowledge about the challenge, while, at the same time, any practical automated system would consume resources (time, space, cash, etc) far beyond the value of the resource protected by the challenge.



The free "Yahoo! Mail" service uses a generic human-skill challenge during
the registration process to prevent automated systems (created by spammers,
for example) from using registering for thousands of e-mail accounts.
Aspects of "Yahoo! Mail", including the generic human-skill challenge, is
shown in the following figure:


One VERY IMPORTANT thing to consider about "Yahoo! Mail", as described by the quotes shown above regarding service features, is that e-mail messages AND attachments are AUTOMATICALLY ANALYZED. While this analysis may have been devised by well-meaning individuals trying to improve the experience of users of the "Yahoo! Mail" service, it is critical to understand that this is detailed analysis of one's communication. Although I believe the privacy of e-mail itself is promised up to a specific point (like court subpoenas), I do not see any such promise of privacy concerning the use of the results of the analyses (which may actually contain rough semantics (meanings) of content). Is a transliteration of an e-mail, using different figures of speech and phrases, and synonyms where possible, assured the same level of privacy? In what sense is even an "anonymous, automated" analysis of e-mail messages and attachments "private"?

Generic human skill challenges can be compromised by automating a system of getting actual humans to unwittingly solve challenges on behalf of the automation. Yet another compromise is to devise algorithms that can actually perform the task that supposedly requires "exclusively-human" skills.




4. COMMUNICATION-RATE REGULATION BY
REQUIRING RESULTS OF DIFFICULT
COMPUTATIONAL TASKS

The following describes the concept of communication rate regulation by
requiring results of difficult computational tasks:

Communication rates can theoretically be regulated by requiring a message sender to solve a random mathematical problem, of known difficulty, posed by the receiving party. If the solution offered by the sender is not correct, according to the receiver, then the message is rejected. The receiver can multiply two probable-prime numbers together, having a product with a specific number of digits, and request that a message sender do the work of factoring the product back in to the prime factors. The receiver can easily check whether or not the factors submitted by the message sender yield the correct number when multiplied together. Although a factoring task will take a different amount of time for different processor types and processor speeds, the order of magnitude of the time required might be the same for all contemporary mail servers. If the message receiver thinks the problem was solved too quickly, it can pose several more problems. The computational work keeps the sender processor very busy, which, in principle, prevents the sender processor from the task of sending lots of messages to the same or multiple communication relays.



Reasons why a communication-rate regulation by requiring results of difficult
computational tasks can fail to control spam include:

(1) Much like distributed computing projects like "SETI@Home" or "Folding@Home", spammers can use trojan viruses or hacked versions of popular Peer-to-Peer (P2P) applications (BitTorrent, Morpheus, KaZaA, DirectConnect, Gnutella, Napster, eDonkey) to "farm out" ("out-source") the computational tasks requested by the pesky communication relay. In fact, the P2P applications can do the work of mail transmission on the spammer's behalf, spreading the computational work. (2) Unless the communication-rate regulating tasks are required between mail relay servers, in addition to being required between sender client and a mail server, then all it takes is for some mail relay server to be compromised, either by hackers or just by unscrupulous server managers, for spam to enter the system with full force. (3) One wonders if "affiliates" of giant corporations will face the same computational tasks when hoping to fill your mailbox with spam! After all, one might be forced to allow all "partners and affiliates" of a mail service provider to notify one of special offers and products. And there is no end to the possible "affiliate" chain...




5. CONTENT ANALYSIS

The following describes the concept of content analysis:

Content analysis involves automated scanning of e-mail messages to differentiate between messages serving the agenda of spammers and messages sent from: family, friends, colleagues, coworkers, subscription mailing lists, and strangers who initiate contact. This differentiation must be based on the content of the message itself. Considering other factors, such as sender address, or blacklisted IPs embedded within the message, are not properly part of plain "content analysis".



Reasons why content analysis is undesirable:

It is difficult to foresee all of the consequences of loss of privacy in one's life. Loss of privacy regarding medical conditions (AIDS, STDs, cancer, drug addiction, alcoholism, etc) can result in discrimination of various sorts. Loss of privacy regarding ideas and beliefs can result in embarrassment or dismissal from jobs. Loss of privacy regarding inventions, creative ideas, business plans, agendas, competitive strategies, etc, obviously can lead to profound financial and intellectual losses. It is interesting to imagine a world totally without privacy. I have considered how I would adapt. Many institutions and social conventions that relied on privacy would be gone. I'm inclined to think that this new world would be better, as hypocrisy and politics would be shattered, and people would operate from truth rather than image. No illusion could be sustained. No crime would go unsolved. No hypocritical positions on drugs, pornography, religion, etc, would go unnoticed. So, I'm a little torn on this issue. If society could flip a switch and enter "total information mode", then maybe I'd feel more comfortable. But when there is an imbalance, with large corporations or government agencies enjoying the benefits of "total information mode", while consumers and citizens do not have such a benefit, then I am concerned. Google's new "G-Mail" service (run by "G-Men"?!) is a paranoiac's nightmare come to life: an e-mail system that analyzes each piece of mail not simply for spam detection and virus detection, but for targeted advertising! Want to embarrass a friend using G-Mail? Send an e-mail mentioning only "farm sex" or "kiddie porn", and who knows what "targeted advertising" he or she will get. Perhaps there will be a knock on the door from the FBI. Despite promises that content analysis will be isolated from any record-keeping, and totally performed anonymously by computer algorithms, the fact is that one's mail is being analyzed, and even requests for banner ads represents an information leak -- connecting your interest with your IP (since your browser fetches the banner ad from whatever source). With a modest amount of extra data mining, an "anonymously targeted" banner ad betrays your identity.



Reasons why content analysis can fail to control spam include:

(1) Ultimately, only a message recipient can decide, based on content alone, whether or not a message is desired. Clearly a "one-filter-fits-all" idea won't work, so the alternative is "training" a filter, or directly specifying message content that is not desired. Even then, rules and exceptions will not handle the next wave of spammer messages. Every content-based filtering method thus far has been thwarted. All a spammer has to do is study a spam filter to devise messages that will pass through. Perhaps this will eventually require the spammer to use human-like AI to generate human-like mail that essentially conveys the spammer's "message" but without triggering any content-based filters. Randomly injected HTML tags thwarted simple scanning and dictionary lookup. Salting the message with random words thwarted Bayesian filtering. Including boring texts from random, but serious, sources, made messages overwhelmingly grammatical and statistically close to potentially "interesting" message content, thwarting advanced content-based filtering. Base-64 encoding added another layer to the content scanning work. Unicode, visual character look-alikes, randomly injected (but non-obtrusive) punctuation within words, strategic misspellings and letter transpositions that are transparent to human readers, etc, all result in the actual MEANING of the content being totally UNDETECTABLE! Finally, the use of images served from a web site, in conjunction with invisible, but totally benign, content, is totally immune to the desired kind of content analysis. (2) As mentioned with other spam filtering methods in previous sections, the problem of mail services not performing filtering on the promotions and offers of their "partners and affiliates" may just change the nature of the spam from world-wide to high-paying corporate sponsors. This puts the burden of content-based filtering on the client.






OTHER UNDESIRABLE SPAM CONTROL METHODS

(1) Internet-wide packet tracking (does not defeat Trojan P2P type outgoing
    mail servers; invades privacy)

(2) "Postage" per e-mail (does not defeat Outlook Trojans and other P2P
    mail servers)

(3) Relying on corporations and building up "user reputation" (like eBay
    system, or Slashdot karma); invades privacy, corporate interests may
    create artificial barriers to moving data (e.g., BREW), and there is
    no guarantee that corporations won't eventually "sell-out" the promise
    of no spam -- or it will be argued that partner "notifications" won't
    be spam (according to the end-user terms).




SPAM CONTROL PROPOSAL

This section contains a proposal for SOFTWARE and SOCIAL PRACTICES that have
the potential of greatly reducing the nuisance of spam from a person's life.


GENERAL INFORMATION

Things required by this proposal:

(1) A person who wishes to greatly reduce spam must install software on each computer with an e-mail client application (such as Microsoft Outlook). (2) A person who wishes to greatly reduce spam, when sharing his or her e-mail address, must also go through the trouble of sharing a code number. (3) Mailing list services must make a slight modification to their databases and mailing scripts to store and use codes in addition to e-mail addresses.


Things that are NOT required by this proposal:

(1) Changes to e-mail servers, e-mail protocols, e-mail content standards, or Internet infrastructure, are not required. (2) Existing spam countermeasures (content-filtering, IP blacklisting, anti-spam laws, etc) will not be necessary. (Such countermeasures are futile and dangerous anyhow.) (3) It is possible that changes to existing e-mail clients will not be required.


Things that will NOT be directly helped by this proposal:

(1) Internet bandwidth consumed by the futile efforts of spammers trying to make it through to people. (Once the futility becomes apparent worldwide, the spamming model may naturally be a very unattractive waste of time.) (2) E-mail "inbox" clogging while the spammer profession lingers on, before the futility of spamming has a chance to sink in worldwide. (3) People with e-mail clients and services provided by giant corporations may not experience the diminished spam until the giant corporations have a chance to update software.


Other qualities of this proposal:

(1) Totally open technology; not "security through obscurity". (2) Non-commercial, public-domain method, can be implemented by anyone without consideration. (3) Totally smooth transition from current e-mail clients, servers, mailing list services, etc. (4) Privacy preserved (no content analysis), and possibly even improved (as proposed software becomes more widespread).


CORE CONCEPT

The following paragraphs describe the core concept of the method. Certain details will be discussed in the "Use Cases" section:

Messages received by an e-mail client will be sorted by codes contained in the message subject fields or within the message bodies. Spam messages are extremely unlikely to contain the proper codes, and are thus diverted to an anonymous-sender category. Unlike an e-mail address alone, which is a single, unmoving target for spammers, the additional codes are generated by formulae, and are tiny, constantly-moving targets in a huge expanse of possible target locations. Furthermore, any breach of trust can instantly be traced to specific unscrupulous people, and immediately and conveniently patched. The concept can be likened to "spread-spectrum" communication, or, much more loosely, "port knocking".


CORE IMPLEMENTATION

The following paragraphs describe the core implementation of the method. Three encrypted files are stored on an e-mail client machine:

(1) PRIMARY FORMULA TABLE: Encrypted table with entries in the form: ( SHA hash of recipient e-mail address, primary formula ) (2) SECONDARY FORMULA TABLE: Encrypted table with entries in the form: ( SHA hash of recipient e-mail address, secondary formula ) (3) CURRENT RECEPTOR TABLE: Encrypted table with entries in the form: ( SHA hash of acceptable formula value, category, expiration policy )


Basic operation:

Some time in advance of communication between a sender and receiver, the sender acquires a formula from the receiver. The formula is generated by the receiver and given to the sender by some "secure" mechanism (which can be as casual as a face-to-face conversation, phone call, postal mail, facsimile, or even conventional e-mail or web page). The only point of keeping the formula secure is to maintain the exclusivity of the specific communication channel between sender and receiver. It is not that others can use the formula to intercept messages, but others could use the formula to get their messages to the receiver through the same moving portal. The sender uses the formula to compute a code, and this code is placed either in the subject line or in the body of the message, and the message is sent by conventional e-mail to the receiver's well-known e-mail address. The receiver launches software to sort incoming mail. When the software is launched, it requests a password. This password decrypts the current receptor table. Incoming messages are scanned for the special codes (which have easily-recognized characteristics). If a message contains a code, the code is hashed using the Secure Hash Algorithm (SHA). The hashed code is compared with all acceptable hash codes in the decrypted copy of the current receptor table residing only in memory. If there is a match, the e-mail is accepted and placed in to the proper mail folder (Family, Friends, Business, Interest Groups, etc). If any entries in the current receptor table are set to expire, then the primary formula table is decrypted in memory and all formulae are evaluated, replacing all entries in the current receptor table. This implements the "moving" aspect of the "moving target" nature of the method. (Senders evaluate the appropriate formula upon each transmission.)


USE CASES

This section describes examples of using the proposed spam control method. Important qualities include: simplicity, convenience, reliability, and minimal transition effort.

1. Personal, Regular Contact
(Family, Friend, Colleague)

Sub-cases: (1) Potential sender cannot or will not use the special software used to implement this method: The receiver (with the special software) can generate and give out a fixed code (like a telephone number) for use by the sender alone. The sender simply puts this code somewhere in each e-mail. If the code should leak beyond the friend or family, the code can be manually "retired" by the receiver, and a new code can be given to the sender. (2) Potential sender is also using the special software: The receiver can give the potential sender a formula (which is essentially a code used to generate codes), and the potential sender enters the receiver e-mail address and the formula in to the primary formula table, and, possibly, a second formula (and receiver e-mail address) in the secondary formula table. When the sender initiates the sending of an e-mail message to the receiver, the special software evaluates the primary formula specific to the receiver and places the appropriate code in to the outgoing message.


2. Mailing List Service

Sub-cases: (1) Mailing list server cannot or will not use the special software used to implement this method: There is nothing to be done in this case. The mail arriving from the mailing list server can only be classified according to its claimed sender, or subject, or content. (2) Mailing list server is also using the special software: When subscribing to a mailing list service, the receiver can give the mailing list service a formula (which is essentially a code used to generate codes), along with the receiver's well-known mailing address. When the mailing list server initiates the sending of an e-mail message to the receiver, the special software evaluates the primary formula specific to the receiver and places the appropriate code in to the outgoing message. *** Ideally, the mailing list server would expect codes from subscribers, too, thus preventing non-subscribers from spamming the list. ***


3. Public Displays (Web Pages, Printed Matter,
etc)

This is really not much different than case "1. Personal, Regular Contact (Family, Friend, Colleague)". For web pages, a server-side script (PHP, ASP, etc) can evaluate a formula to display codes on the web content -- preferably as images. Anonymous human visitors can read the codes and use them with any e-mail client to contact the receiver associated with the web site. The formula can be associated with a single page of the site, or with the entire site. The formula can be re-evaluated on any desired time-scale, thus frustrating anyone who takes the time to manually enter the current code in to a spammer database. For printed content, a code can be printed along with the well-known e-mail address. This code is specific to the printed campaign, much like a URL for a movie promotion web site is only ever going to be used for that specific movie promotion campaign. The code can be retired after the promotion is over.


BREACHES OF SECURITY

This section describes various scenarios involving breaches of security.

(1) Trial-and-Error attack on specific user's e-mail address: Apart from clogging the user's e-mail "inbox", the only possible reward for the unlikely event of making it through the method's barrier is that a single e-mail gets through. The attacker does not learn anything after the attack, and only a similar effort would have a chance of getting another e-mail through. The probability of success is inversely proportional to the length of the codes generated by the formulae -- which, for seven-letter codes, would be 1/(8031810176/2), about 1 in 4 billion chance, roughly multiplied by number of active formulae, per attempt. Most e-mail "inboxes" would be clogged by 4 billion e-mail messages, and such an attack is clearly an entirely different problem from the spam problem. (2) Finding copy of a plain-text e-mail at any location accessible through the Internet (even unauthorized or unanticipated access via unprotected directories or accidental wireless exposure to private drives): This compromises a code. If the code is associated with a formula-based channel (all parties have the special software), nothing needs to be done; the code will expire, and it may in fact be a one-time use code. Worst-case outcome: A stranger sends e-mail that is not filtered out by the receiver. (3) Compromise of a system with a "mailing list" or "address book" (in both cases this simply means the three files listed in the CORE IMPLEMENTATION section): Sub-cases: (A) The files are corrupted or deleted: The user of the software will know this has occurred. One option is to restore the files from a recent backup. (B) A hacker does a key-logging on the machine and acquires the key for decrypting the primary formula table and current receptor table -- and then copies those files from the compromised machine: This is a serious compromise. If this has been suspected or detected, then a user can rid the system of intrusion and then use a different key to decrypt the secondary formula table and command each recipient to replace entries for the compromised sender and switch to the secondary entries. New formulae can be exchanged by methods that depend on how secure the parties feel circumstances are at that point. (4) Compromise of system relaying e-mail at any point in the Internet: Sub-cases: (A) Message content is encrypted: Compromise has no impact. (B) Message content is not encrypted; codes can be extracted: In this case a code has been compromised, in addition to message content. If the sender has the special software, the code may only work once, and the snooping party will not be able to get through the receiver's shield when the code expires (which can be upon receipt of a message). (5) Compromise of any archive or cache of e-mail messages Same as case (4) (6) Compromise of a desktop computer via executable (altered pirated software, MS-Outlook script, prior e-mail attachment virus, IE exploits/Active X, Windows/Linux exploits, etc): This is a serious compromise. The hacker has free reign of one's computer, including the ability to corrupt or delete the encrypted files associated with the special software. The hacker can also modify or replace the special software. The hacker will only gain access to the decrypted content of the primary and secondary formula files if the regular user of the compromised system opens those files following the compromise event. Even if the hacker has access to the decrypted contents of those files, the hacker will have to hash all e-mail contacts to determine the correlation between contacts and formulae. Finally, if the hacker really accomplishes all of this, the only reward is the ability to avoid the e-mail filtering of one's contacts. The kind of compromise described here is far worse than the problem of receiving unwanted spam! (7) Unscrupulous business shares voluntarily offered formula to "partners and affiliates": The formula can be instantly removed from the receiver's table at the first sign of abuse, and one can indicate the violation of trust to the business.


LONG-TERM IDEAS

Automatic encryption and decryption of message content at sender and receiver prevents all intermediate routers, Mail Transfer Agents (MTA)s, subnets, etc, from doing content analysis or complete spying on messages. A plug-in implementing RSA Public-Key cryptography for various mail clients can be found at (http://www.rsa.org). Ideally, the special software proposed in this article would have this capability. If all e-mail contains the codes for the method described in the proposal, and all e-mail is encrypted, then codes will enjoy longer usefulness in keeping a channel reserved for certain parties to communicate, and content will remain private.

CONCLUSION REGARDING PROPOSED METHOD

I did not describe the details of how the proposed system would work, but I hope the proposal aspect of this article leads to more thinking about solutions to spam -- especially about solutions that avoid invasion of privacy by any form of content analysis or packet tracking, or cooperation with specific corporations, or censorship.




CONCLUSION

Spam messages may be annoying, and may consume resources, but I strongly 
disagree with laws punishing spam.  I also strongly disagree with any 
efforts to filter messages flowing through mail servers, or the practice
of blacklisting hosts or domains.  None of these approaches will be 
effective in the long term.  Meanwhile, my proposed solution makes e-mail
as private as an individual desires, and does so without losing a single
message to an over-zealous filter or to corporate or government censors.




POSTSCRIPT -- RESPONSE TO SLASHDOT
COMMENTS

I wrote this article, and submitted it to Slashdot, in an effort to increase
awareness of the technology of spam and the perils of various previously-
proposed solutions.

I acknowledge shortcomings with my own proposal, but I hope my article has
contributed something to the ongoing conversations about controlling spam.

After reading most of the Slashdot comments regarding my article, and 
talking with a smart friend, I have been led to the following conclusions:

(1) The proposal in my article does not SOLVE a critical problem: "How does an automated system differentiate between acceptable attempts of strangers to make contact versus unacceptable attempts of strangers to make contact?" I did not have the clarity to think of the spam problem in terms of this simple security puzzle when I wrote this article -- but it's exciting now to think about this specific issue. (2) The proposal in my article does not SOLVE another critical problem: Consumption of resources. Specifically: (a) Consumption of bandwidth all along the route from spammer to destination mailbox; (b) Consumption of disk space on intermediate and final destination mail servers; (c) Consumption of precious bandwidth between ISP and person downloading e-mail (especially when using dial-up connnections);


I am considering these issues, along with the comments made on Slashdot. The following Slashdot comments have supported or refuted various ideas put forward in the article, and I am meditating on them:

Your post advocates a ( ) technical (*) legislative ( ) market-based ( ) vigilante approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.) ( ) Spammers can easily use it to harvest email addresses ( ) Mailing lists and other legitimate email uses would be affected ( ) No one will be able to find the guy or collect the money ( ) It is defenseless against brute force attacks ( ) It will stop spam for two weeks and then we'll be stuck with it (*) Users of email will not put up with it ( ) Microsoft will not put up with it ( ) The police will not put up with it ( ) Requires too much cooperation from spammers ( ) Requires immediate total cooperation from everybody at once ( ) Many email users cannot afford to lose business or alienate potential employers ( ) Spammers don't care about invalid addresses in their lists ( ) Anyone could anonymously destroy anyone else's career or business Specifically, your plan fails to account for ( ) Laws expressly prohibiting it ( ) Lack of centrally controlling authority for email ( ) Open relays in foreign countries ( ) Ease of searching tiny alphanumeric address space of all email addresses ( ) Asshats (*) Jurisdictional problems ( ) Unpopularity of weird new taxes ( ) Public reluctance to accept weird new forms of money ( ) Huge existing software investment in SMTP ( ) Susceptibility of protocols other than SMTP to attack ( ) Willingness of users to install OS patches received by email ( ) Armies of worm riddled broadband-connected Windows boxes ( ) Eternal arms race involved in all filtering approaches ( ) Extreme profitability of spam (*) Joe jobs and/or identity theft ( ) Technically illiterate politicians ( ) Extreme stupidity on the part of people who do business with spammers ( ) Dishonesty on the part of spammers themselves ( ) Bandwidth costs that are unaffected by client filtering ( ) Outlook and the following philosophical objections may also apply: ( ) Ideas similar to yours are easy to come up with, yet none have ever been shown practical ( ) Any scheme based on opt-out is unacceptable ( ) SMTP headers should not be the subject of legislation ( ) Blacklists suck ( ) Whitelists suck ( ) We should be able to talk about Viagra without being censored ( ) Countermeasures should not involve wire fraud or credit card fraud ( ) Countermeasures should not involve sabotage of public networks ( ) Countermeasures must work if phased in gradually ( ) Sending email should be free ( ) Why should we have to trust you and your servers? ( ) Incompatiblity with open source or open source licenses ( ) Feel-good measures do nothing to solve the problem ( ) Temporary/one-time email addresses are cumbersome ( ) I don't want the government reading my email (*) Killing them that way is not slow and painful enough Furthermore, this is what I think about you: (*) Sorry dude, but I don't think it would work. ( ) This is a stupid idea, and you're a stupid person for suggesting it. ( ) Nice try, assh0le! I'm going to find out where you live and burn your house down!


I don't know what spam data you used, bit i've noticed quite a few spams getting through my bayesian filter lately... they all have more random words in sentances at the bottom than the real message at top. They do it like 'hank urged me and I to send you this flower and important notice' Bad grammer but i'm sure it's ment to look like a 'real' sentance since the computer can't 'read' like a person. It's kinda like an adlib game... they make a list of several hundred sentances with verbs and/or nouns missing then use word lists to fill them in.


we're using GFI MailEssentials at work, which uses bayesian filtering. all the emails that have a random string of words arent getting through the filter. the ones that are usually contain, e.g. 1 normal sentence, and one link. the problem is, we arent getting the same one over and over, it only comes in once, so it's hard to train the filter to block it. but hey, what do i know?


"I use bogofilter and have a corpus of 20k spam messages, I always rescore misfiltered spam, and I still get messages that slip through the filter. Almost all are messages with a ton of random garbage appended to the message, and one spammer was actually putting whole passages from some book about Abe Lincoln in the messages. Jamming the message with non-spam words works too well around here."


It's dangerously bad. If email messages accurately identified where they came from, and if spammers didn't maliciously forge addresses of people they want to harass, and if spammers didn't usually abuse free email systems and free web pages or forge purely bogus sender addresses (usually also at free email systems), then that would be a fine idea. Many spammers also frequently put other people's valid URLs in their mail to fake legitimacy, e.g. URLs from E-Bay's news site or the Better Business Bureau or various anti-virus companies, in addition to having their own URL for the suckers to click.


Strange, when I create new e-mail accounts, like my work account, I generally have a couple messages of spam waiting for me already.


That worked for me until I emailed a customer feedback comment to a somewhat large corporation which makes a product I really like. I also got a satisfactory reply from their customer representative. A few months later, that *expletive* customer representative forwards one of those stupid urban myth chain-letters (about some missing kid/fake amber alert), using that company's email address book, which included my email address! Then the spam deluge started. :(


I used to think that would work, then I went to college and had spam in my inbox before I had even first checked it. I hadn't given out the address to anyone, not even my close friends.


Is it just me, or has recent spam flavor included random sentences (not just random word lists) that are meant to sound like a plausible person is on the other end? Then, embedding some link to spam inside, in an attempt to get the S/N filters to let it pass?


This is what I got: Jen, searching for a site to purchase medication? Character is that which reveals moral purpose, exposing the class of things a man chooses or avoids. Those who aim at great deeds must also suffer greatly. Let your imagination release your imprisoned possibilities. We are able to ship worldwide Be thrifty, but not covetous. Go here and get it You are totally anonymous! I confess I enjoy democracy immensely. It is incomparably idiotic, and hence incomparably amusing. Epigrams succeed where epics fail. The only line I deleted was the one with the url... now tell me what this spam message was trying to sell!


I've been getting alot of spam recently that have images of the spam text itself, and include a "post-modern story of the day" at the bottom in plain text in order to trick/reduce effectiveness of Bayesian filters. Or some will do the same and just have a collection of random words (like "baseball sports espn issue meeting car subaru" etc).


It should be self-evident that this solution is not workable. Anything that requires this massive type of retooling of the whole method of using e-mail is doomed to failure. Any proposed solution cannot cause this type of massive interruption of normal e-mail usage.


"It should be self-evident that this solution is not workable. Anything that requires this massive type of retooling of the whole method of using e-mail is doomed to failure." This attitude is what keeps real solutions from occuring. SMTP/POP3 is antiquated, designed for a simpler time, and it needs replaced, period. If there were anything in its standards that could truly prevent spam, don't you think someone would have come up with it in the last 15 years? And so what if we have "interruption of normal e-mail usage" for a while? What do you think we have now? Millions of tiny "interruptions" bouncing around 24 hours a day. Slowing things down, wasting resources, wasting time, etc. These band-aid fixes are just that. They are not a solution. So I don't have to see the beastiality or xanax ads anymore, great. That doesn't mean they aren't still consuming mass resources in their continuous effort to reach me. "retooling of the whole method of using e-mail" is exactly what needs to happen, and not just because of the spam epidemic.


Although I don't think this article has the right solution, I don't see a problem with redesigning the email method. If a "spam-free" email exists in parallel with the email as we have it now, I will divert the spam-free mail to my inbox, and the spammy mail, through a filter, to a junk-suspect folder to be checked once a week. Of course, this spammy mail will get an auto-reply that tells the sender how to contact me using the spam-free protocol. After a while, I am certain the people I really want to hear from will all use the spam-free protocol, and I will stop checking the regular email, after the changing the auto-reply to "your mail has just been ignored". The key to get a massive retooling accepted is by using the original one in parallel. It will die off soon enough.


I don't care about the retooling. The core problem with this idea is the concept of handing out codes. Retraining users will be a pain in the ass. Not to mention just tracking the codes would suck. Say I meet someone in real life. How do I have a code to give them? Do I have to have a buisness card with codes on it? Do I need a computer just to exchange an email address. If you want to use codes or signatures, It has to be server side. That way a server could sign an email as valid. Then my email client could take that signature under advisement when the email actually arrives. The difference in action is an excercise left to the reader. However its alot nicer in the ease of use hurdles and doesn't screw old setups.


I wouldn't say he reinvented the whitelist. The whitelist is based on resending an e-mail after it's bounced back to the sender because it's an unrecognized e-mail address. This technique relies on something that's similar to public/private keys, with a dynamic code that helps detect true users from automated ones. My main gripe (that I just realized) is that some e-mail must be send automatically, like web server confirmations. They would get sent into your "other" inbox with the thousands of spam messages if you lacked the persons "code".


Personally I rally liked D. J. Bernstein's (qmail, djbdns, daemontools) idea for a new mail protocol. The big difference between it and mail we have now is that only the notification of mail is sent, not the mail itself. The mail sits on the senders mailserver, waiting to be picked up, and if you want to retrieve it, your mail client does so from his server. Think about it - No more anonymous spam, since you KNOW where messages are coming from if you have to retreive them. Therefore, if spam is illegal, we can punish them... and there is no more faking of where its coming from. The other cool concept to that is mailing lists vs bandwidth. In old mailing list styles, a message would go out to the list, bouncing back from all people whos boxes are gone or full- witha lot of traffic. In DJs new way, there is only notification of the message sent, and then only those who really want the message download it. The more you think about it, the better of an idea it becomes. In the wold of terrifying ideas like "postage for emails" or "really super-mega -expensive domain names for mail only" Bernsteins has an elegance and practicality I haven't seen elsewhere.


The big difference between it and mail we have now is that only the notification of mail is sent, not the mail itself. Options: a) Notification contains no sender-modifiable content. No way to know if you want it or not. You say yes and wind up with spam from unknown server. b) Notification winds up containing the entire spam as subject line, and the supposed server it's coming from doesn't exist. c) Spammers break into millions of unsecured Windows boxes and run 'mail servers' on them. Nice try, but no cigar.


I administer a mail server for a small ISP. The problem with filtering on the user's end is that my costs are consumed by the time the user deals with the spam. I don't think, as the article suggests, that spammers will slow down if their message is not being read, in fact they will just spew out ever more spam. If a 1/10 of 1% hit rate does not deter them, a smaller hit rate won't either. I have to put some upper limit to the amount of storage I can give each person (right now I allow 100M, which I think is quite reasonable). But if a user goes on vacation and does not check their e-mail for a month, they could have their inbox filled with spam and viruses (not much difference these days, from a server admin point of view). This will preven legitamate messages from coming through. Therefore, I use the following technical measures to help reduce spam: RBLs: dnsbl.njabl.org, sbl.spamhaus.org, xbl.spamhaus.org, and dul.dnsbl.sorbs.net SPF:Sender (not adopted widely yet, but it does block a few messages a day even now) Blocking specific subject lines (during virus outbreaks this can help) Blocking mail "from" non-existant domains I really have no choice, I cannot afford not to take these measures. I explain all of them to my clients, nobody has had a problem yet. These measures catch roughly 75% of spam and viruses, and as far as I know, no false positives.


A portion of my e-mail "Inbox" on 2004 March 29th as manifested by the "Microsoft Outlook Express 5" application. On this date I received 9 "legitimate" messages, 77 spam messages, and 2 virus attachments. And later: cpfahey@earthlink.net Outlook Express, public e-mail address and he is complaining about spam. Surprise, surprise!







CONTACT INFORMATION

Colin P. Fahey Irvine, California; USA cpfahey@earthlink.net

http://www.colinfahey.com