math::statistics(n) | Tcl Math Library | math::statistics(n) |
math::statistics - Basic statistical functions and procedures
package require Tcl 8
package require math::statistics 0.5
::math::statistics::mean data
::math::statistics::min data
::math::statistics::max data
::math::statistics::number data
::math::statistics::stdev data
::math::statistics::var data
::math::statistics::pstdev data
::math::statistics::pvar data
::math::statistics::median data
::math::statistics::basic-stats data
::math::statistics::histogram limits values
::math::statistics::corr data1 data2
::math::statistics::interval-mean-stdev data confidence
::math::statistics::t-test-mean data est_mean est_stdev confidence
::math::statistics::test-normal data confidence
::math::statistics::lillieforsFit data
::math::statistics::quantiles data confidence
::math::statistics::quantiles limits counts confidence
::math::statistics::autocorr data
::math::statistics::crosscorr data1 data2
::math::statistics::mean-histogram-limits mean stdev number
::math::statistics::minmax-histogram-limits min max number
::math::statistics::linear-model xdata ydata intercept
::math::statistics::linear-residuals xdata ydata intercept
::math::statistics::test-2x2 n11 n21 n12 n22
::math::statistics::print-2x2 n11 n21 n12 n22
::math::statistics::control-xbar data ?nsamples?
::math::statistics::control-Rchart data ?nsamples?
::math::statistics::test-xbar control data
::math::statistics::test-Rchart control data
::math::statistics::tstat dof ?alpha?
::math::statistics::mv-wls wt1 weights_and_values
::math::statistics::mv-ols values
::math::statistics::pdf-normal mean stdev value
::math::statistics::pdf-exponential mean value
::math::statistics::pdf-uniform xmin xmax value
::math::statistics::pdf-gamma alpha beta value
::math::statistics::pdf-poisson mu k
::math::statistics::pdf-chisquare df value
::math::statistics::pdf-student-t df value
::math::statistics::pdf-beta a b value
::math::statistics::cdf-normal mean stdev value
::math::statistics::cdf-exponential mean value
::math::statistics::cdf-uniform xmin xmax value
::math::statistics::cdf-students-t degrees value
::math::statistics::cdf-gamma alpha beta value
::math::statistics::cdf-poisson mu k
::math::statistics::cdf-beta a b value
::math::statistics::random-normal mean stdev number
::math::statistics::random-exponential mean number
::math::statistics::random-uniform xmin xmax number
::math::statistics::random-gamma alpha beta number
::math::statistics::random-chisquare df number
::math::statistics::random-student-t df number
::math::statistics::random-beta a b number
::math::statistics::histogram-uniform xmin xmax limits number
::math::statistics::incompleteGamma x p ?tol?
::math::statistics::incompleteBeta a b x ?tol?
::math::statistics::filter varname data expression
::math::statistics::map varname data expression
::math::statistics::samplescount varname list expression
::math::statistics::subdivide
::math::statistics::plot-scale canvas xmin xmax ymin ymax
::math::statistics::plot-xydata canvas xdata ydata tag
::math::statistics::plot-xyline canvas xdata ydata tag
::math::statistics::plot-tdata canvas tdata tag
::math::statistics::plot-tline canvas tdata tag
::math::statistics::plot-histogram canvas counts limits tag
The math::statistics package contains functions and procedures for basic statistical data analysis, such as:
It is meant to help in developing data analysis applications or doing ad hoc data analysis, it is not in itself a full application, nor is it intended to rival with full (non-)commercial statistical packages.
The purpose of this document is to describe the implemented procedures and provide some examples of their usage. As there is ample literature on the algorithms involved, we refer to relevant text books for more explanations. The package contains a fairly large number of public procedures. They can be distinguished in three sets: general procedures, procedures that deal with specific statistical distributions, list procedures to select or transform data and simple plotting procedures (these require Tk). Note: The data that need to be analyzed are always contained in a simple list. Missing values are represented as empty list elements.
The general statistical procedures are:
(This routine is called whenever either or all of the basic statistical parameters are required. Hence all calculations are done and the relevant values are returned.)
The correlation is determined in such a way that the first value is always 1 and all others are equal to or smaller than 1. The number of values involved will diminish as the "time" (the index in the list of returned values) increases
The correlation is determined in such a way that the values can never exceed 1 in magnitude. The number of values involved will diminish as the "time" (the index in the list of returned values) increases.
Convenience function - the result is suitable for the histogram function.
The result consists of the following list:
Returns a list of the differences between the actual data and the predicted values.
Returns the "chi-square" value, which can be used to the determine the significance.
Returns a short report, useful in an interactive session.
Returns the mean, the lower limit, the upper limit and the number of data per subsample.
Returns the mean range, the lower limit, the upper limit and the number of data per subsample.
Returns a list of subsamples (their indices) that indeed violate the limits.
Returns a list of subsamples (their indices) that indeed violate the limits.
Besides the linear regression with a single independent variable, the statistics package provides two procedures for doing ordinary least squares (OLS) and weighted least squares (WLS) linear regression with several variables. They were written by Eric Kemp-Benedict.
In addition to these two, it provides a procedure (tstat) for calculating the value of the t-statistic for the specified number of degrees of freedom that is required to demonstrate a given level of significance.
Note: These procedures depend on the math::linearalgebra package.
Description of the procedures
for the number of degrees of freedom dof.
P(t*) = 1 - alpha/2
P(-t*) = alpha/2
Given a sample of normally-distributed data x, with an estimate xbar for the mean and sbar for the standard deviation, the alpha confidence interval for the estimate of the mean can be calculated as
The return values from this procedure can be compared to an estimated t-statistic to determine whether the estimated value of a parameter is significantly different from zero at the given confidence level.
( xbar - t* sbar , xbar + t* sbar)
The linear model is of the form
and each point satisfies
y = b0 + b1 * x1 + b2 * x2 ... + bN * xN + error
yi = b0 + b1 * xi1 + b2 * xi2 + ... + bN * xiN + Residual_i
The procedure returns a list with the following elements:
This procedure simply calls ::mvlinreg::wls with the weights set to 1.0, and returns the same information.
Example of the use:
# Store the value of the unicode value for the "+/-" character set pm "\u00B1" # Provide some data set data {{ -.67 14.18 60.03 -7.5 }
{ 36.97 15.52 34.24 14.61 }
{-29.57 21.85 83.36 -7. }
{-16.9 11.79 51.67 -6.56 }
{ 14.09 16.24 36.97 -12.84}
{ 31.52 20.93 45.99 -25.4 }
{ 24.05 20.69 50.27 17.27}
{ 22.23 16.91 45.07 -4.3 }
{ 40.79 20.49 38.92 -.73 }
{-10.35 17.24 58.77 18.78}} # Call the ols routine set results [::math::statistics::mv-ols $data] # Pretty-print the results puts "R-squared: [lindex $results 0]" puts "Adj R-squared: [lindex $results 1]" puts "Coefficients $pm s.e. -- \[95% confidence interval\]:" foreach val [lindex $results 2] se [lindex $results 3] bounds [lindex $results 4] {
set lb [lindex $bounds 0]
set ub [lindex $bounds 1]
puts " $val $pm $se -- \[$lb to $ub\]" }
In the literature a large number of probability distributions can be found. The statistics package supports:
In principle for each distribution one has procedures for:
The following procedures have been implemented:
1 / x p-1
P(p,x) = -------- | dt exp(-t) * t
Gamma(p) / 0
TO DO: more function descriptions to be added
The data manipulation procedures act on lists or lists of lists:
The following simple plotting procedures are available:
The following procedures are yet to be implemented:
The code below is a small example of how you can examine a set of data:
# Simple example: # - Generate data (as a cheap way of getting some) # - Perform statistical analysis to describe the data # package require math::statistics # # Two auxiliary procs # proc pause {time} {If you run this example, then the following should be clear:
set wait 0
after [expr {$time*1000}] {set ::wait 1}
vwait wait } proc print-histogram {counts limits} {
foreach count $counts limit $limits {
if { $limit != {} } {
puts [format "<%12.4g\t%d" $limit $count]
set prev_limit $limit
} else {
puts [format ">%12.4g\t%d" $prev_limit $count]
}
} } # # Our source of arbitrary data # proc generateData { data1 data2 } {
upvar 1 $data1 _data1
upvar 1 $data2 _data2
set d1 0.0
set d2 0.0
for { set i 0 } { $i < 100 } { incr i } {
set d1 [expr {10.0-2.0*cos(2.0*3.1415926*$i/24.0)+3.5*rand()}]
set d2 [expr {0.7*$d2+0.3*$d1+0.7*rand()}]
lappend _data1 $d1
lappend _data2 $d2
}
return {} } # # The analysis session # package require Tk console show canvas .plot1 canvas .plot2 pack .plot1 .plot2 -fill both -side top generateData data1 data2 puts "Basic statistics:" set b1 [::math::statistics::basic-stats $data1] set b2 [::math::statistics::basic-stats $data2] foreach label {mean min max number stdev var} v1 $b1 v2 $b2 {
puts "$label\t$v1\t$v2" } puts "Plot the data as function of \"time\" and against each other" ::math::statistics::plot-scale .plot1 0 100 0 20 ::math::statistics::plot-scale .plot2 0 20 0 20 ::math::statistics::plot-tline .plot1 $data1 ::math::statistics::plot-tline .plot1 $data2 ::math::statistics::plot-xydata .plot2 $data1 $data2 puts "Correlation coefficient:" puts [::math::statistics::corr $data1 $data2] pause 2 puts "Plot histograms" ::math::statistics::plot-scale .plot2 0 20 0 100 set limits [::math::statistics::minmax-histogram-limits 7 16] set histogram_data [::math::statistics::histogram $limits $data1] ::math::statistics::plot-histogram .plot2 $histogram_data $limits puts "First series:" print-histogram $histogram_data $limits pause 2 set limits [::math::statistics::minmax-histogram-limits 0 15 10] set histogram_data [::math::statistics::histogram $limits $data2] ::math::statistics::plot-histogram .plot2 $histogram_data $limits d2 puts "Second series:" print-histogram $histogram_data $limits puts "Autocorrelation function:" set autoc [::math::statistics::autocorr $data1] puts [::math::statistics::map $autoc {[format "%.2f" $x]}] puts "Cross-correlation function:" set crossc [::math::statistics::crosscorr $data1 $data2] puts [::math::statistics::map $crossc {[format "%.2f" $x]}] ::math::statistics::plot-scale .plot1 0 100 -1 4 ::math::statistics::plot-tline .plot1 $autoc "autoc" ::math::statistics::plot-tline .plot1 $crossc "crossc" puts "Quantiles: 0.1, 0.2, 0.5, 0.8, 0.9" puts "First: [::math::statistics::quantiles $data1 {0.1 0.2 0.5 0.8 0.9}]" puts "Second: [::math::statistics::quantiles $data2 {0.1 0.2 0.5 0.8 0.9}]"
This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such in the category math :: statistics of the Tcllib SF Trackers [http://sourceforge.net/tracker/?group_id=12883]. Please also report any ideas for enhancements you may have for either package and/or documentation.
data analysis, mathematics, statistics
Mathematics
0.5 | math |