Jack Cook. Information Security and Ethics: Concepts, Methodologies, Tools, and Applications. Editor: Hamid Nemati, Volume 3, Information Science Reference, 2008.
Introduction
Decision makers thirst for answers to questions. As more data is gathered, more questions are posed: Which customers are most likely to respond positively to a marketing campaign, product price change or new product offering? How will the competition react? Which loan applicants are most likely or least likely to default? The ability to raise questions, even those that currently cannot be answered, is a characteristic of a good decision maker. Decision makers no longer have the luxury of making decisions based on gut feeling or intuition. Decisions must be supported by data; otherwise decision makers can expect to be questioned by stockholders, reporters, or attorneys in a court of law. Data mining can support and often direct decision makers in ways that are often counterintuitive. Although data mining can provide considerable insight, there is an “inherent risk that what might be inferred may be private or ethically sensitive” (Fule & Roddick, 2004, p. 159).
Extensively used in telecommunications, financial services, insurance, customer relationship management (CRM), retail, and utilities, data mining more recently has been used by educators, government officials, intelligence agencies, and law enforcement. It helps alleviate data overload by extracting value from volume. However, data analysis is not data mining. Query-driven data analysis, perhaps guided by an idea or hypothesis, that tries to deduce a pattern, verify a hypothesis, or generalize information in order to predict future behavior is not data mining (Edelstein, 2003). It may be a first step, but it is not data mining. Data mining is the process of discovering and interpreting meaningful, previously hidden patterns in the data. It is not a set of descriptive statistics. Description is not prediction. Furthermore, the focus of data mining is on the process, not a particular technique, used to make reasonably accurate predictions. It is iterative in nature and generically can be decomposed into the following steps: (1) data acquisition through translating, cleansing, and transforming data from numerous sources, (2) goal setting or hypotheses construction, (3) data mining, and (4) validating or interpreting results.
The process of generating rules through a mining operation becomes an ethical issue, when the results are used in decision-making processes that affect people or when mining customer data unwittingly compromises the privacy of those customers (Fule & Roddick, 2004). Data miners and decision makers must contemplate ethical issues before encountering one. Otherwise, they risk not identifying when a dilemma exists or making poor choices, since all aspects of the problem have not been identified.
Background
Technology has moral properties, just as it has political properties (Brey 2000; Feenberg, 1999; Sclove, 1995; Winner, 1980). Winner (1980) argues that technological artifacts and systems function like laws, serving as frameworks for public order by constraining individuals’ behaviors. Sclove (1995) argues that technologies possess the same kinds of structural effects as other elements of society, such as laws, dominant political and economic institutions, and systems of cultural beliefs. Data mining, being a technological artifact, is worthy of study from an ethical perspective due to its increasing importance in decision making, both in the private and public sectors. Computer systems often function less as background technologies and more as active constituents in shaping society (Brey, 2000). Data mining is no exception. Higher integration of data mining capabilities within applications ensures that this particular technological artifact will increasingly shape public and private policies.
Data miners and decision makers obviously are obligated to adhere to the law. But ethics are oftentimes more restrictive than what is called for by law. Ethics are standards of conduct that are agreed upon by cultures and organizations. Supreme Court Justice Potter Stewart defines the difference between ethics and laws as knowing the difference between what you have a right to do (legally, that is) and what is right to do. Sadly, a number of IS professionals either lack an awareness of what their company actually does with data and data mining results or purposely come to the conclusion that it is not their concern. They are enablers in the sense that they solve management’s problems. What management does with that data or results is not their concern.
Most laws do not explicitly address data mining, although court cases are being brought to stop certain data mining practices. A federal court ruled that using data mining tools to search Internet sites for competitive information may be a crime under certain circumstances (Scott, 2002). In EF Cultural Travel BV vs. Explorica Inc. (No. 01-2000 1st Cir. Dec. 17, 2001), the First Circuit Court of Appeals in Massachusetts held that Explorica, a tour operator for students, improperly obtained confidential information about how rival EF’s Web site worked and used that information to write software that gleaned data about student tour price s from EF’s Web site in order to undercut EF’s prices (Scott, 2002). In this case, Explorica probably violated the federal Computer Fraud and Abuse Act (18 U.S.C. Sec. 1030). Hence, the source of the data is important when data mining.
Typically, with applied ethics, a morally controversial practice, such as how data mining impacts privacy, “is described and analyzed in descriptive terms, and finally moral principles and judgments are applied to it and moral deliberation takes place, resulting in a moral evaluation, and operationally, a set of policy recommendations” (Brey, 2000, p. 10). Applied ethics is adopted by most of the literature on computer ethics (Brey, 2000). Data mining may appear to be morally neutral, but appearances in this case are deceiving. This paper takes an applied perspective to the ethical dilemmas that arise from the application of datamining in specific circumstances as opposed to examining the technological artifacts (i.e., the specific software and how it generates inferences and predictions) used by data miners.
Main Thrust
Computer technology has redefined the boundary between public and private information, making much more information public. Privacy is the freedom granted to individuals to control their exposure to others. A customary distinction is between relational and informational privacy. Relational privacy is the control over one’s person and one’s personal environment, and concerns the freedom to be left alone without observation or interference by others. Informational privacy is one’s control over personal information in the form of text, pictures, recordings, and so forth (Brey, 2000).
Technology cannot be separated from its uses. It is the ethical obligation of any information systems (IS) professional, through whatever means he or she finds out that the data that he or she has been asked to gather or mine is going to be used in an unethical way, to act in a socially and ethically responsible manner. This might mean nothing more than pointing out why such a use is unethical. In other cases, more extreme measures may be warranted. As data mining becomes more commonplace and as companies push for even greater profits and market share, ethical dilemmas will be increasingly encountered. Ten common blunders that a data miner may cause, resulting in potential ethical or possibly legal dilemmas, are (Skalak, 2001):
- Selecting the wrong problem for data mining.
- Ignoring what the sponsor thinks data mining is and what it can and cannot do.
- Leaving insufficient time for data preparation.
- Looking only at aggregated results, never at individual records.
- Being nonchalant about keeping track of the mining procedure and results.
- Ignoring suspicious findings in a haste to move on.
- Running mining algorithms repeatedly without thinking hard enough about the next stages of the data analysis.
- Believing everything you are told about the data.
- Believing everything you are told about your own data mining analyses.
- Measuring results differently from the way the sponsor will measure them.
These blunders are hidden ethical dilemmas faced by those who perform data mining. In the next subsections, sample ethical dilemmas raised with respect to the application of data mining results in the public sector are examined, followed briefly by those in the private sector.
Ethics of Data Mining in the Public Sector
Many times, the objective of data mining is to build a customer profile based on two types of data—factual (who the customer is) and transactional (what the customer does) (Adomavicius & Tuzhilin, 2001). Often, consumers object to transactional analysis. What follows are two examples; the first (identifying successful students) creates a profile based primarily on factual data, and the second (identifying criminals and terrorists) primarily on transactional.
Identifying Successful Students
Probably the most common and well-developed use of data mining is the attraction and retention of customers. At first, this sounds like an ethically neutral application. Why not apply the concept of students as customers to the academe? When students enter college, the transition from high school for many students is overwhelming, negatively impacting their academic performance. High school is a highly structured Monday-through-Friday schedule. College requires students to study at irregular hours that constantly change from week to week, depending on the workload at that particular point in the course. Course materials are covered at a faster pace; the duration of a single class period is longer; and subjects are often more difficult. Tackling the changes in a student’s academic environment and living arrangement as well as developing new interpersonal relationships is daunting for students. Identifying students prone to difficulties and intervening early with support services could significantly improve student success and, ultimately, improve retention and graduation rates.
Consider the following scenario that realistically could arise at many institutions of higher education. Admissions at the institute has been charged with seeking applicants who are more likely to be successful (i.e., graduate from the institute within a five-year period). Someone suggests data mining existing student records to determine the profile of the most likely successful student applicant. With little more than this loose definition of success, a great deal of disparate data is gathered and eventually mined. The results indicate that the most likely successful applicant, based on factual data, is an Asian female whose family’s household income is between $75,000 and $125,000 and who graduates in the top 25% of her high school class. Based on this result, admissions chooses to target market such high school students. Is there an ethical dilemma? What about diversity? What percentage of limited marketing funds should be allocated to this customer segment? This scenario highlights the importance of having well-defined goals before beginning the data mining process. The results would have been different if the goal were to find the most diverse student population that achieved a certain graduation rate after five years. In this case, the process was flawed fundamentally and ethically from the beginning.
Identifying Criminals and Terrorists
The key to the prevention, investigation, and prosecution of criminals and terrorists is information, often based on transactional data. Hence, government agencies increasingly desire to collect, analyze, and share information about citizens and aliens. However, according to Rep. Curt Weldon (R-PA), chairman of the House Subcommittee on Military Research and Development, there are 33 classified agency systems in the federal government, but none of them link their raw data together (Verton, 2002). As Steve Cooper, CIO of the Office of Homeland Security, said, “I haven’t seen a federal agency yet whose charter includes collaboration with other federal agencies” (Verton, 2002, p. 5). Weldon lambasted the federal government for failing to act on critical data mining and integration proposals that had been authored before the terrorists’ attacks on September 11, 2001 (Verton, 2002).
Data to be mined is obtained from a number of sources. Some of these are relatively new and unstructured in nature, such as help desk tickets, customer service complaints, and complex Web searches. In other circumstances, data miners must draw from a large number of sources. For example, the following databases represent some of those used by the U.S. Immigration and Naturalization Service (INS) to capture information on aliens (Verton, 2002).
- Employment Authorization Document System
- Marriage Fraud Amendment System
- Deportable Alien Control System
- Reengineered Naturalization Application Casework System
- Refugees, Asylum, and Parole System
- Integrated Card Production System
- Global Enrollment System
- Arrival Departure Information System
- Enforcement Case Tracking System
- Student and Schools System
- General Counsel Electronic Management System
- Student Exchange Visitor Information System
- Asylum Prescreening System
- Computer-Linked Application Information Management System (two versions)
- Non-Immigrant Information System
There are islands of excellence within the public sector. One such example is the U.S. Army’s Land Information Warfare Activity (LIWA), which is credited with “having one of the most effective operations for mining publicly available information in the intelligence community” (Verton, 2002, p. 5).
Businesses have long used data mining. However, recently, governmental agencies have shown growing interest in using “data mining in national security initiatives” (Carlson, 2003, p. 28). Two government data mining projects, the latter renamed by the euphemism “factual data analysis,” have been under scrutiny (Carlson, 2003) These projects are the U.S. Transportation Security Administration’s (TSA) Computer Assisted Passenger Prescreening System II (CAPPS II) and the Defense Advanced Research Projects Agency’s (DARPA) Total Information Awareness (TIA) research project (Gross, 2003). TSA’s CAPPS II will analyze the name, address, phone number, and birth date of airline passengers in an effort to detect terrorists (Gross, 2003). James Loy, director of the TSA, stated to Congress that, with CAPPS II, the percentage of airplane travelers going through extra screening is expected to drop significantly from 15% that undergo it today (Carlson, 2003). Decreasing the number of false positive identifications will shorten lines at airports.
TIA, on the other hand, is a set of tools to assist agencies such as the FBI with data mining. It is designed to detect extremely rare patterns. The program will include terrorism scenarios based on previous attacks, intelligence analysis, “war games in which clever people imagine ways to attack the United States and its deployed forces,” testified Anthony Tether, director of DARPA, to Congress (Carlson, 2003, p. 22). When asked how DARPA will ensure that personal information caught in TIA’s net is correct, Tether stated that “we’re not the people who collect the data. We’re the people who supply the analytical tools to the people who collect the data” (Gross, 2003, p. 18). “Critics of data mining say that while the technology is guaranteed to invade personal privacy, it is not certain to enhance national security. Terrorists do not operate under discernable patterns, critics say, and therefore the technology will likely be targeted primarily at innocent people” (Carlson, 2003, p. 22). Congress voted to block funding of TIA. But privacy advocates are concerned that the TIA architecture, dubbed “mass dataveil-lance,” may be used as a model for other programs (Carlson, 2003).
Systems such as TIA and CAPPS II raise a number of ethical concerns, as evidenced by the overwhelming opposition to these systems. One system, the Multistate Anti-TeRrorism Information EXchange (MATRIX), represents how data mining has a bad reputation in the public sector. MATRIX is self-defined as “a pilot effort to increase and enhance the exchange of sensitive terrorism and other criminal activity information between local, state, and federal law enforcement agencies” (matrix-at.org, accessed June 27, 2004). Interestingly, MATRIX states explicitly on its Web site that it is not a data-mining application, although the American Civil Liberties Union (ACLU) openly disagrees. At the very least, the perceived opportunity for creating ethical dilemmas and ultimately abuse is something the public is very concerned about, so much so that the project felt that the disclaimer was needed. Due to the extensive writings on data mining in the private sector, the next subsection is brief.
Ethics of Data Mining in the Private Sector
Businesses discriminate constantly. Customers are classified, receiving different services or different cost structures. As long as discrimination is not based on protected characteristics such as age, race, or gender, discriminating is legal. Technological advances make it possible to track in great detail what a person does. Michael Turner, executive director of the Information Services Executive Council, states, “For instance, detailed consumer information lets apparel retailers market their products to consumers with more precision. But if privacy rules impose restrictions and barriers to data collection, those limitations could increase the prices consumers pay when they buy from catalog or online apparel retailers by 3.5% to 11%” (Thibodeau, 2001, p. 36). Obviously, if retailers cannot target their advertising, then their only option is to mass advertise, which drives up costs.
With this profile of personal details comes a substantial ethical obligation to safeguard this data. Ignoring any legal ramifications, the ethical responsibility is placed firmly on IS professionals and businesses, whether they like it or not; otherwise, they risk lawsuits and harming individuals. “The data industry has come under harsh review. There is a raft of federal and local laws under consideration to control the collection, sale, and use of data. American companies have yet to match the tougher privacy regulations already in place in Europe, while personal and class-action litigation against businesses over data privacy issues is increasing” (Wilder & Soat, 2001, p. 38).
Future Trends
Data mining traditionally was performed by a trained specialist, using a stand-alone package. This once nascent technique is now being integrated into an increasing number of broader business applications and legacy systems used by those with little formal training, if any, in statistics and other related disciplines. Only recently has privacy and data mining been addressed together, as evidenced by the fact that the first workshop on the subject was held in 2002 (Clifton & Estivill-Castro, 2002). The challenge of ensuring that data mining is used in an ethically and socially responsible manner will increase dramatically.
Conclusion
Several lessons should be learned. First, decision makers must understand key strategic issues. The data miner must have an honest and frank dialog with the sponsor concerning objectives. Second, decision makers must not come to rely on data mining to make decisions for them. The best data mining is susceptible to human interpretation. Third, decision makers must be careful not to explain away with intuition data mining results that are counterintuitive. Decision making inherently creates ethical dilemmas, and data mining is but a tool to assist management in key decisions.