Sampling :
Sampling is an commonly used approach for selecting a subset of the data objects to be analyzed.
- The moviatations for sampling in statistics and data mining are often different : Statisticians use sampling because obtaining the entire set of data of interest is too expensive or time consuming, while the data miners because it is too expensive or time consuming process all the data.
- In some cases, using a sampling algorithm can reduce the data size to the point where a better, but more expensive algorithm can be used.
- The key principle of effective sampling is the following : Using a sample will work almost as well as using the entire data set if the sample is repsentive.
- A sample is representative if it has approximately the same property as the original set of data.
- If the mean of the data objects is the property of interest, then a sample is representative if it has a mean that is close to that of the original data.