## More Precious than Gold?
My goal here is to briefly discuss what is meant by the phrase data mining and what mathematical tools and ideas have been brought to bear in trying to make progress in this field. ...
Joseph Malkevitch
## Introduction
Gold has been much admired by human societies at different times and in different places. In 2011 the price of gold (love that data) on some markets exceeded $1900 an ounce. Black gold has been a metaphor for the importance and value of oil (currently over $100 a barrel) for many years now. But could it be that it will be ## Data, Data, Everywhere
What is "data"? Data is a term used in a variety of ways, most commonly as "facts" or statistics in numerical form. However, sometimes the term is used to mean information that is going to be analyzed or used to make some decision. Another relatively recent "definition" of data involves the numbers, symbols, strings and tables (matrices) that are manipulated or analyzed with a computer.
To understand data in the sense of seeing patterns that are present not due to "noise" or chance measurement errors, requires an understanding of probability theory. Probability theory and statistics in many ways are subjects that go hand in hand. However, probability is an intrinsically difficult and subtle subject. While the mathematics of probability theory is now on a firm foundation, mathematical modeling with probability, using probability in subjects outside of mathematics, is very complicated. When one makes the statement that a particular physical coin C has a slight bias towards heads, so that the probability of a head is .501 while the probability of tails is .499, how does one interpret this statement? If one tosses C over and over again in a way where the physical tossing of the coin does not affect the way the coin shows up heads or tails (skilled individuals can take a 2-sided coin and by using the same very precise motion of tossing do so in way that heads will come up all the time), one will get a pattern of H's (for heads) and T's (for tails) that appear. For example, I tossed a coin 10 times and got this pattern:
(Photo of Leonard Savage)
Part of the issue is to use common language (whether in English, French, etc.) to express the different environments in which situations come up which involve "noise," "randomness," "chance," or something "unexpected." The nature of radioactive decay has a very different character from where the next tornado might hit in a particular state, or what is the chance that it will hit a particular town on a particular day. ## Pioneers of probability and statistics
It is very rare in mathematics that important ideas come out of nowhere and have no antecedents. Many individuals from many countries have contributed to probability and statistics. Here is a small sample of such important contributors. Certainly one of the early pioneers of the relative frequency approach to probability theory was the French philosopher and mathematician
(Portrait of Blaise Pascal)
A major insight into probability theory was taken by
(Portrait of Thomas Bayes)
Bayes' famous result involves the notion of conditional probability for the outcomes of what one sees in the world (experiment) or are the results of some hypothetical experiments. Such outcomes are referred to as
(Portrait of Adolphe Quetelet)
In more modern times a growing number of individuals have made contributions to statistics and its interface with probability theory. My purpose here is to show not only that people from many backgrounds and countries have contributed to our current richer understanding of statistics and probability but also how much of what is now known is of quite recent origin. To do proper justice to this topic would require a book-length treatment.
(Karl Pearson)
Like all of the parts of mathematics, when examined "under a microscope," statistics has had a rich and complex history with contributions from many people who thought of themselves as mathematicians but also were in many cases from other areas of intellectual endeavor.
(Photo of John Maynard Keynes)
Photo of Ronald Fisher
Photo of Jerzy Neyman
(Photo of Bruno de Finetti)
As the power of the digital computer progressed, mathematicians began to take advantage of that power to explore and draw implications from data.
(Photo of John Tukey)
## Data-intensive subjects
It is sometimes claimed that recently we are "drowning" in data. This notion has come about in part because there is so much data being generated, collected, and stored that most of us don't have the time to look at it all, no less think about the implications of the data. While in some sense all aspects of 21st century American life are becoming data driven it may be helpful to just list a few of the areas that are "data intensive." ## Tools of data mining
The scaffolding that surrounds data mining is the mathematics of statistics, enriched with ideas from artificial intelligence and computational learning theory.
More recently, IBM also designed a computer system called Watson that beat the "best" human opponents at Jeopardy, a game which involves a complex mixture of factual recall in an environment of linguistic playfulness. A contestant gets to choose a topic with a particular number of points where the difficulty of the question changes with the point value assigned. From time to time the contestant may pick a question, which if answered, correctly will double the amount of money received. Sometimes when a question is chosen in a particular category the contestant gets to pick what part of the current earning he/she/it has to "wager" on getting the right answer. Thus, there is a "Final Jeopardy" round where one can try to overtake one's opponent by betting a large part of one's current winnings in hopes of overtaking an opponent if one gives a correct response. On February 14-16, 2011 in a special series of TV shows that was screened in prime time, the IBM "system" known as Watson beat two very impressive human opponents, despite a few strange bits of "behavior" on Watson's part in answering questions. Watson's buzzer skills (ringing in when it was ready to answer) were very impressive but it tended to perform less well when the clues that it had to answer a question were very short. Listen to one of the IBM researchers talk about his work on Watson. The AMS encourages your comments, and hopes you will join the discussions. We review comments before they're posted, and those that are offensive, abusive, off-topic or promoting a commercial product, person or website will not be posted. Expressing disagreement is fine, but mutual respect is required. ## References:
Bergeron, B., Bioinformatics Computing, Prentice-Hall, NY, 2002.
Joseph Malkevitch |
Welcome to the These web essays are designed for those who have already discovered the joys of mathematics as well as for those who may be uncomfortable with mathematics. Search Feature Column Feature Column at a glance |