More Precious than Gold?
My goal here is to briefly discuss what is meant by the phrase data mining and what mathematical tools and ideas have been brought to bear in trying to make progress in this field. ...
Gold has been much admired by human societies at different times and in different places. In 2011 the price of gold (love that data) on some markets exceeded $1900 an ounce. Black gold has been a metaphor for the importance and value of oil (currently over $100 a barrel) for many years now. But could it be that it will be data that will be the really valuable commodity of the future? And if so, what does that have to do with mathematics?
Data, Data, Everywhere
What is "data"? Data is a term used in a variety of ways, most commonly as "facts" or statistics in numerical form. However, sometimes the term is used to mean information that is going to be analyzed or used to make some decision. Another relatively recent "definition" of data involves the numbers, symbols, strings and tables (matrices) that are manipulated or analyzed with a computer.
To understand data in the sense of seeing patterns that are present not due to "noise" or chance measurement errors, requires an understanding of probability theory. Probability theory and statistics in many ways are subjects that go hand in hand. However, probability is an intrinsically difficult and subtle subject. While the mathematics of probability theory is now on a firm foundation, mathematical modeling with probability, using probability in subjects outside of mathematics, is very complicated. When one makes the statement that a particular physical coin C has a slight bias towards heads, so that the probability of a head is .501 while the probability of tails is .499, how does one interpret this statement? If one tosses C over and over again in a way where the physical tossing of the coin does not affect the way the coin shows up heads or tails (skilled individuals can take a 2-sided coin and by using the same very precise motion of tossing do so in way that heads will come up all the time), one will get a pattern of H's (for heads) and T's (for tails) that appear. For example, I tossed a coin 10 times and got this pattern:
(Photo of Leonard Savage)
Part of the issue is to use common language (whether in English, French, etc.) to express the different environments in which situations come up which involve "noise," "randomness," "chance," or something "unexpected." The nature of radioactive decay has a very different character from where the next tornado might hit in a particular state, or what is the chance that it will hit a particular town on a particular day.
Pioneers of probability and statistics
It is very rare in mathematics that important ideas come out of nowhere and have no antecedents. Many individuals from many countries have contributed to probability and statistics. Here is a small sample of such important contributors. Certainly one of the early pioneers of the relative frequency approach to probability theory was the French philosopher and mathematician Blaise Pascal (1623-1662). Pascal developed his ideas in the context of assisting with insight into the games of chance that gamblers practice.
(Portrait of Blaise Pascal)
A major insight into probability theory was taken by Thomas Bayes (1702-1761). Bayes was an English minister who also did work in mathematics.
(Portrait of Thomas Bayes)
Bayes' famous result involves the notion of conditional probability for the outcomes of what one sees in the world (experiment) or are the results of some hypothetical experiments. Such outcomes are referred to as events. When one tosses a fair coin C (by which I will mean that the stabilized relative frequency of probability of head or tail is each 1/2) ten times, one can ask for the probability of the event that exactly 7 of the 10 tosses resulted in a tail, or a different event, that at least 7 or the ten tosses resulted in a head. Suppose that the probability of having a boy child or a girl child is 1/2, and births are independent of each other (which loosely speaking means that the outcome for one child does not affect the outcome for other children). We can ask: What is the probability that a couple's next two children will be girls? If an ordered string GGB means first child a girl, second child a girl, and third child a boy we can denote the required probability as: P(GG). We can also use subscripts to indicate birth order: B1, G2, G1B2 would denote the three events first child a boy, second child a girl, first child a girl and second child a boy, respectively. Now suppose that we know that the first child turned out to be a girl. We can now ask, what is the probability that both children will be girls? We are asking for the "conditional probability" of having two girls given that the first child was a girl. More generally, we can write: P (X | Y) = probability that event X occurs given that event Y occurred. We can also write P ( Y | X ) = probability of event Y given that X has occurred. It is not difficult to see on intuitive grounds (relative frequency interpretation) that provided P(Y) is not zero:
(Portrait of Adolphe Quetelet)
In more modern times a growing number of individuals have made contributions to statistics and its interface with probability theory. My purpose here is to show not only that people from many backgrounds and countries have contributed to our current richer understanding of statistics and probability but also how much of what is now known is of quite recent origin. To do proper justice to this topic would require a book-length treatment.
Like all of the parts of mathematics, when examined "under a microscope," statistics has had a rich and complex history with contributions from many people who thought of themselves as mathematicians but also were in many cases from other areas of intellectual endeavor. John Maynard Keynes (1883-1946) is best known for his work in economics but he wrote an important book on probability theory in 1921 after having studied, among other things, mathematics at Cambridge University. Keynes' work was looked at by Lord Russell and also came to the attention of Frank Ramsey who, in addition to his famous work in combinatorics (what is today called Ramsey's Theorem), did important work on the mathematical and philosophical foundations of probability theory. While initially Keynes' view of probability tended towards the stabilized relative frequency approach, as time went on he moved towards a more subjective--intensity of belief--system, possibly because in economics that has a natural appeal.
(Photo of John Maynard Keynes)
Ronald Fisher (1890-1962) was another British mathematician educated at Cambridge. He is known for a large variety of accomplishments in statistics. So called F-tests are named for him. Fisher also had an interest in experimental design. He worked for many years at the Agricultural Experiment Station in Rothamsted, England. There he was involved with procedures that could establish how the effects of different "treatments" of plants affected them. A treatment could be a watering regime, different type of soil, or a fertilizer regimen. Using statistical analysis the idea was to sort out the effects of different types of treatments on different types of plants, say, to increase yield of a food plant. Ironically, and sadly, at one point in their careers Fisher and Pearson were involved in a heated dispute about Fisher's statistical ideas and methods.
Photo of Ronald Fisher
Jerzy Neyman (1894-1981) was born in Russia but ended his career in the United States. Along the way he lived in London for some time where he interacted with Egon Pearson (1895-1980), the son of Karl Pearson, who was also a statistician. In the United States, Neyman had a position at the University of California at Berkeley, where he help make the Statistics Department world famous.
Photo of Jerzy Neyman
Bruno de Finetti (1906-1985) was born in Austria but educated in Italy and died in Rome. He became known for promoting a "subjective" view of the meaning of probabilities.
(Photo of Bruno de Finetti)
As the power of the digital computer progressed, mathematicians began to take advantage of that power to explore and draw implications from data. John Tukey (1915-2000), who coined the term "bit" for a binary digit, was a pioneer in the field of exploratory data analysis. He started out as a chemistry student, not in mathematics, but eventually earned a doctorate in mathematics from Princeton University. Tukey worked for many years for Bell Laboratories. While at Bell Labs, where he reached the rank of Associate Director, Tukey was involved with finding ways to exploit the growing power, increased memory, and greater speed of computers, to get as much information as possible from data. Bell Labs, and Tukey, helped develop innovative ways to display data sets. The human eye is exquisitely sensitive to visual patterns, so Tukey and others explored how to display data in a way to use the human visual system to get insight from the data. Tukey, together with James Cooley was also responsible for important work in signal processing via his work on the Fast Fourier Transform (1965).
(Photo of John Tukey)
It is sometimes claimed that recently we are "drowning" in data. This notion has come about in part because there is so much data being generated, collected, and stored that most of us don't have the time to look at it all, no less think about the implications of the data. While in some sense all aspects of 21st century American life are becoming data driven it may be helpful to just list a few of the areas that are "data intensive."
Tools of data mining
The scaffolding that surrounds data mining is the mathematics of statistics, enriched with ideas from artificial intelligence and computational learning theory.
More recently, IBM also designed a computer system called Watson that beat the "best" human opponents at Jeopardy, a game which involves a complex mixture of factual recall in an environment of linguistic playfulness. A contestant gets to choose a topic with a particular number of points where the difficulty of the question changes with the point value assigned. From time to time the contestant may pick a question, which if answered, correctly will double the amount of money received. Sometimes when a question is chosen in a particular category the contestant gets to pick what part of the current earning he/she/it has to "wager" on getting the right answer. Thus, there is a "Final Jeopardy" round where one can try to overtake one's opponent by betting a large part of one's current winnings in hopes of overtaking an opponent if one gives a correct response. On February 14-16, 2011 in a special series of TV shows that was screened in prime time, the IBM "system" known as Watson beat two very impressive human opponents, despite a few strange bits of "behavior" on Watson's part in answering questions. Watson's buzzer skills (ringing in when it was ready to answer) were very impressive but it tended to perform less well when the clues that it had to answer a question were very short. Listen to one of the IBM researchers talk about his work on Watson.
The AMS encourages your comments, and hopes you will join the discussions. We review comments before they're posted, and those that are offensive, abusive, off-topic or promoting a commercial product, person or website will not be posted. Expressing disagreement is fine, but mutual respect is required.
Bergeron, B., Bioinformatics Computing, Prentice-Hall, NY, 2002.
Welcome to the
These web essays are designed for those who have already discovered the joys of mathematics as well as for those who may be uncomfortable with mathematics.
Search Feature Column
Feature Column at a glance