E-Mail Data Stores

Catherine Horiuchi. Encyclopedia of Database Technologies and Applications. Editor: Laura C Rivero. Idea Group Reference, 2006.


For many people, e-mail has become a running record of their business and personal lives. Somewhere in that big clot of e-mail messages that have accumulated over the years is a wealth of information about people they’ve met, work they’ve done, meetings they’ve held. There are tough calls and tender moments, great debates and funny episodes. When did you first meet a certain person? Just what was the initial offer in a business deal? What was that joke somebody sent you in 1998? The answers lie in your old e-mail. Unfortunately, it’s often easier to search the vast reaches of the World Wide Web than to quickly and accurately search your own stored e-mail. (Mossberg, 2004)

Electronic mail, or e-mail, has evolved from its beginnings as one of the earliest Internet applications. The network originally connected computers to computers, but in 1977, RFC 733 updated the messaging protocol to “focus on people and not mailboxes as recipients” (Crocker, Vittal, Pogran, & Henderson, 1977, p.1). Once considered a simple method to send text messages between two machines, e-mail has become a complex system of hardware and software interfaces between individuals and institutions. Messaging technologies manage the flow of major lines of business and significant public sector policy processes. As a corollary to this, databases associated with e-mail now rank among the most mission-critical data stores in many organizations.


User-oriented client software interfaces create flexibility to map messages in patterns strongly congruent with the way individuals think and organize information. This usability has resulted in e-mail becoming the catch basin of an organization’s intellectual capital and institutional history, particularly knowledge and activities that are not captured by software systems focused on basic business processes, such as inventory and accounts receivable. The message store grows organically over time. The central message store is linear in nature, with messages stored chronologically, but copies of those messages also are managed by each message originator and recipient. Multiple instances of a message are organized in individualistic fashion in multiple locations, based on each user’s business and personal priorities. Options also exist to create, send, and store messages in encrypted format. Firms face critical decisions regarding e-mail administration including policies on retention, software package management, and mitigation of risks from worms, viruses, and e-mail bombs. Administering e-mail is further complicated by the multiple parties who have the option to discard, retain, or forward an e-mail message, creating yet more copies: the sender, the recipient, the administrator of the originating mail system, and the administrator of the receiving mail system. Table 1 describes the basic location of e-mail messages.

The highly congruent, highly personal aspects of e-mail have contributed to efforts to capitalize on these attributes in an organized fashion. These efforts have varied in approach, and each faces specific challenges, discussed in the following section.

Data Management Challenges

Strategies to capture the knowledge held in e-mails have ranged from benign neglect (e.g., limited to backing up and archiving a central e-mail message store), to direct integration with a primary business application (e.g., using mail message protocols within a supply-chain software package, such as SAP), to sophisticated programming to capitalize on a particular e-mail platform (e.g., business application programming on Lotus Notes). Each approach is complicated by authentication, platform dependence, data corruptibility, and referential integrity issues.

The simplest strategy, benign neglect, is also the most common: The message store is backed up with the rest of the data on the server. If a user inadvertently deletes a message considered important, a request to the system administrator can result in it being restored from backup. This strategy can also meet legal requirements to retain e-mail if the system administrator has been notified of the requirement and has adequate storage capacity and processes in place. However, it is also dependent on the original sender/recipient to reestablish the connection between the particular e-mail message and its context among many issues and correspondents. And if the message was encrypted, loss of the original key will result in a need for a brute-force decryption, a time-consuming process. This simplest strategy also fails to address referential integrity, the principle that assures all copies of data are congruent and no part of the data loses its association with other elements. For instance, if an attachment is deleted, the message is incomplete; if an Internet site linked to a message is altered or expires, the message no longer retains the same meaning.

For each e-mail account, a message and any attachments must be stored somewhere until it is deleted. P. Hoffman (2004) offers a succinct list of common component terms:
• On individual, end-user systems, a common approach for POP3 users. Messages are copied from a mail server message store onto an individual’s computer. The messages are then deleted from the server.
• On servers, a common approach for IMAP users with web-based mail clients. Messages are stored only on the mail server.
Most mail systems are configured to store messages both on users’ machines and in a central database repository. Large message stores are usually included in system backups, resulting in further replication of messages and attachments.
Table 1. Where e-mail data is stored

Organizations with multiple hardware and software systems struggle to collect and analyze data at the organizational level (i.e., metadata analysis). To improve connections and reduce the time required for metadata analysis, firms may replace numerous free-standing applications with an enterprise resource management (ERP) package, or combine data sources into a more integrated data warehouse (Zeng, Chang, & Yen, 2003). These ERP and data warehouse solutions include hooks for messaging technologies to establish links between departments and even external companies in supply-chain automation. This linking technology is compatible with several major e-mail engines so that firms can leverage their existing e-mail systems’ knowledge in service to these specialized applications.

Combining a corporate-level e-mail system with a corporate-level business software package automates many manual processes and creates a strong audit trail. However, this creates a high degree of dependence on the technology companies who license the software products as well as any consultants who may have been hired to write specialized software routines and database extensions targeting particular business process automation. These type of technology enhancement projects easily run into the tens of millions for initial implementation and millions for annual maintenance. ERP packages and similar integrated software strategies address the referential integrity problem inherent in having multiple systems that describe various aspects of a single transaction. Instead of separate systems that catalog the purchase of a piece of equipment-its installation at a location, maintenance schedule, depreciation, ultimate removal, and salvaging – a single system tags all these events to the equipment, resulting in a more comprehensive picture. Although this operational cohesion is of high value to management, a firm’s dependence on particular vendors results in loss of competitive pressure. Transitioning to an alternate vendor involves major expense, as does changing the integrated system to meet new business requirements. Historically, firms used software/hardware packages for decades, but that was before software tightly programmed employee behaviors that must shift with changing economic cycles and market challenges.

Authentication between major applications and external data stores can be handled in more than one way, with differing security profiles. The least satisfactory method, from a database administrator point of view, assigns administrative privileges to an application. The database administrator does not control rights of users in the application and cannot match data requests to users. It also exposes the data store to hacking exploits designed to enter the data store using other methods, such as a query engine inherent to the database management system. An alternative method, with named users and permissions at both the database and application level, may require users to authenticate multiple times. This basic dilemma underlies efforts for “single sign on.” In most instances, an end user has several separate entities established with the operating systems of multiple servers as well as applications, and a portion of their authentication has been established in a pass-through or permissions-table fashion. A firm risks security compromises to information based on the degree to which it maintains an active user directory.

Rather than buying a software package and using the messaging software merely to route transactions, other firms have centered on the messaging itself and written extensive software enhancements to the basic message function. This is the Lotus Notes strategy. It is best suited for firms with substantial intellectual property, as opposed to firms with extensive inventory to track or manufacturing processes around which the concept of supply-chain management was originally developed. This strategy uses a technical staff with a deep understanding of both technology and the firm’s business. Its principal weakness is the required special programming for each integrated element, resulting in its attention primarily to high-value integration. Lower priority data will remain outside the data store, and any messaging regarding those materials will have no connection to the other systems that manage them.

Customer relationship management (CRM) software, while originating in call-center management, can be designed around multimodal communication with customers. The customer becomes the primary focus, whether contacted by phone, fax, e-mail, standard mail, or direct face-to-face meetings. Information about existing and potential customers is compounded into the data store (Bose & Sugumaran, 2003). This strategy combines the referential integrity of major software systems with the intellectual property management of a Lotus Notes model. Its primary weakness is its origin in telephone technologies and free-standing call-center business models. To the degree that firms contract out noncore operations, they potentially fragment the knowledge or open access to internal systems through very weak authentication protocols to partner firms.

McCray and Gallager (2001) catalogs the following principles for designing, implementing, and maintaining a digital library: Expect change, know your content, involve the right people, be aware of proprietary data rights, automate whenever possible, adhere to standards, and be concerned about persistence. These library principles are equally relevant for designing, implementing, and maintaining an e-mail data store regardless of the knowledge-management strategy adopted. Applications that interact with messaging systems anticipate stability and persistence in the data store. Messages have mixed data types as attachments, users anticipate the messages will be retrievable to the limits of the organization’s persistence policy, and they must remain readable despite likely hardware and software changes.

An organization’s managers and staff can consider technical options and decide upon a workable information management strategy. This does not end their e-mail data-management considerations. The comprehensive information contained in metadata stores, and the persistence of the information, creates legal and ethical questions that must be considered. Legal matters, new privacy concerns, and historic fiduciary requirements result in a body of intergovernmental regulations to consider as well.

Legal and Ethical Questions

In the United States, existing e-mail messages are subject to discovery, a legal process to compel the exchange of information. Because of this risk, and to limit the amount of server storage space that must be managed, corporations may establish e-mail policies to routinely purge data stores. To the degree end users transfer messages to their local machines, these efforts can be confounded. Furthermore, e-mail stores are constrained legally to be simultaneously available and unavailable. For public agencies, the Freedom of Information Act requires accessibility, and the Privacy Act of 1974 requires that all personal information be protected (Enneking, 1998).

Business trends and government rulemaking efforts have resulted in a technologically complex environment for e-mail data stores. Globalization-the economic and social changes related to penetration of linking technologies, such as computer networking and the accompanying rise of multinational firms-creates exposure to multiple nation–state rules, though no international body enforces any constraints on global firms; enforcement remains state-specific. A plethora of international regulations on the use of encryption has affected vendors of e-mail technologies and created variations in allowable management of message stores. Case law and federal legislation have tried to reduce the quantity of unsolicited e-mail or “spam” that comprises a sizable percentage of all Internet messages. Regarding data replication, a ruling determined that the state’s privacy act was not violated when copies of e-mail messages were printed or stored as evidence (Washington State v. Townsend, 2001).

Lawyers have special interests in e-mail in terms of managing legal matters related to technology used by their clients and also within their own firms. Hopkins and Reynolds (2003) argued that ethical representation of a client should require informing the client of the value of an encrypted message, and the use of encryption for all communications, despite a 1999 American Bar Association position that stated lawyers did not violate rules of professional contact in sending unencrypted e-mail. The capacity to easily encrypt (which is now built into most commercial e-mail products) combined with increases in cyber crime and ease by which data packets can be intercepted, suggests a preference for encrypted e-mail as more secure than traditional mail or faxes that are customary methods for lawyer–client communication.

Inadvertent mistakes in using e-mail applications can result in erroneous transmissions of what in other media might be considered confidential and privileged information. For instance, pressing “reply all” instead of “reply” might send information to parties outside a privileged contact (Hall & Estrella, 2000). Judicial and ethical opinion is divided on whether the act of sending the e-mail—even inadvertently—waives attorney-client privilege, resulting in each situation being evaluated individually, according to the jurisdiction involved. For the most part, the receiving party is able to review and retain these materials for use. Table 2 cites instances of these mistakes.

Emerging communication technologies, such as instant messaging and third-party email servers, create new legal challenges for firms trying to manage their information (Juhnke, 2003).

Future Trends

A degree of stability is created by implementing standard records management processes and using application or platform-specific archiving methods for e-mail data stores. But messaging data management is complicated by adoption of new communication media, such as instant messaging (IM) and Webcasting. In IM or chat mode, a user is online in real time with another user at another computer, both network connected. Originally, chat sessions were not captured and stored, but were transient, similar to most telephone calls. And similar to a modern telephone conversation with a customer service representative, IM sessions are now considered a part of the normal business communications to be captured, stored, indexed, and analyzed.

The U.S. government requires financial firms to be able to turn over IM logs within 24 hours of a request, just as with an e-mail. Failing to meet this requirement can be costly: The Securities and Exchange Commission fined five firms $8.25 million dollars in 2002 for not preserving e-mail and making it available as required (T. Hoffman, 2004).

No straightforward method exists to manage multiple data stores created on separate devices and managed through multiple proprietary networks. Basic individual e-mail stores on computers can generally be sorted into folders, clustering related ideas on a topic. Store-and-forward mail servers catalog details on recipients of e-mail. Instant messaging is session-oriented and transitory. Although some large, technology-oriented firms store these sessions for replay, they are not so easily indexed and searched, much less integrated into a company’s metadata repository. Rapid adaptation of new communication technologies results in periodic stranding of data and loss of business intelligence.


Over a quarter century has passed since the Internet messaging protocols defining e-mail message transfers were adopted, expanding communication options. E-mail and the enhancement of IM are simple technologies easily mastered and congruent with human psychology. Organizations can capture these interactions, resulting in massive data stores with a rich mix of data types in attachments. Their importance is evident in efforts to incorporate the knowledge collected in these communication media within ERP and CRM installations.

Legal and ethical constraints may exceed the technical and sociologic challenges. To create a framework that can succeed, organizations must include in their business strategies routine assessments of success factors in their adoption and adaptation of proprietary messaging and application software. High failure rates exist for many of these projects. Emerging case law regarding e-mail increases the complexity in maximal use of message stores. Proliferation of messages and limited manageability are side effects of these policies.

These attributes of metadata creation and analysis favor the largest firms with the most sophisticated staff and data-management systems. The smallest, nimblest firms can adopt any emerging trend in information, and must do so to create a strategic niche. Once a trend persists, these small firms can be acquired and incrementally their new data media integrated into the larger firm. So it appears that form and function are indeed merging between technology and process.

Learning the hard way that “DELETE” doesn’t necessarily get rid of e-mail messages. By 1986, the entire White House was using IBM’s PROFS (“professional office system”) e-mail software. In late November of that year, Oliver North and John Poindexter deleted thousands of e-mail messages as the Iran-Contra scandal broke. The system backups were not destroyed, however, and the Tower Commission was able to obtain this material in its hearings on unauthorized arms-for-hostages deals (Blanton, 1995).
The Morris Worm. The first Internet worm (the Morris worm) was launched November 2, 1988. It exploited a hole in UNIX’s sendmail routine. The spread of the worm was an unintended result of a flaw in Morris’s code (Hafner & Markoff, 1995).
Table 2. Most infamous e-mail faux pas