PDL-Stats statistics modules in Perl Data Language

This is a collection of statistics modules in Perl Data Language, with a quick-start guide for non-PDL people.

They make perldl--the simple shell for PDL--work like a teenie weenie R, but with PDL threading--"the fast (and automagic) vectorised iteration of 'elementary operations' over arbitrary slices of multidimensional data"--of procedures including t-test, ordinary least squares regression, and k-means clustering.

Of course, they also work in perl scripts, which makes the package an ideal tool for statistical natural language processing--you get all the text processing power of perl as well as the fast number crunching capabilities of a data language.

Documentations (PODs)

PDL::Stats

Loads modules named below. Pod includes quick-start for non-PDL people.

PDL::Stats::Basic

Basic statistics and related utilities (standard deviation, variance, correlation, t-test, etc.).

PDL::Stats::Distr

Parameter estimations and probability density functions for distributions.

PDL::Stats::GLM

General linear modeling methods (multiple linear regression, factorial, repeated measures, and mixed model anova, etc.) and logistic regression.

PDL::Stats::Kmeans

Classic k-means cluster analysis.

PDL::Stats::TS

Basic time series analysis functions.

PDL::GSL::CDF

PDL interface to GSL Cumulative Distribution Functions.

Dependencies

PDL

Perl Data Language. Preferably installed with a Fortran compiler. A few methods (logistic regression and all plotting methods) will only work with a Fortran compiler and some methods (ordinary least squares regression and pca) work much faster with a Fortran compiler.

The recommended PDL version is 2.4.8. PDL-2.4.7 introduced a bug in lu_decomp() which caused a few functions in PDL::Stats::GLM to fail. Otherwise the minimum compatible PDL version is 2.4.4.

GSL (Optional)

GNU Scientific Library. This is required by PDL::Stats::Distr and PDL::GSL::CDF, the latter of which provides p-values for PDL::Stats::GLM. GSL is NOT required for core PDL::Stats modules to work, ie. Basic, GLM, and Kmeans.

PGPLOT (Optional)

PDL-Stats currently uses PGPLOT for plotting. There are three pgplot/PGPLOT modules. This has led to much confusion upon installation. First there is the pgplot Fortran library. Then there is the perl PGPLOT module, the perl interface to pgplot. Finally there is PDL::Graphics::PGPLOT, which depends on pgplot and PGPLOT, that PDL-Stats uses for plotting.

Installation

If you are using debian Wheezy or Ubuntu 12.10 and above, you can easily install the package by

    sudo apt-get install libpdl-stats-perl

Or, you can use cpan

    sudo cpan PDL::Stats

You can also follow the standard perl module installation method in *nix environment and build it from the source,

    tar xvf PDL-Stats-xxx.tar.gz
    cd PDL-Stats-xxx

    perl Makefile.PL
    make
    make test
    sudo make install

If you have got PDL (mostly) installed, this should be trivial. If you have trouble installing PDL, you can look for help at the PDL wiki or the PDL mailing list.

Thanks to Sisyphus, Windows users can download and install the ppm version of PDL-Stats and all dependencies using the PPM utility included in ActiveState perl or Strawberry perl. You can also get the PPM utility from CPAN.

    ppm install http://www.sisyphusion.tk/ppm/PGPLOT.ppd
    ppm install http://www.sisyphusion.tk/ppm/PDL.ppd
    ppm install http://www.sisyphusion.tk/ppm/PDL-Stats.ppd