PDL-Stats statistics modules in Perl Data Language

PDL::Stats::Basic

  • NAME
  • DESCRIPTION
  • SYNOPSIS
  • FUNCTIONS
  • METHODS
  • SEE ALSO
  • REFERENCES

    NAME

    PDL::Stats::Basic -- basic statistics and related utilities such as standard deviation, Pearson correlation, and t-tests.

    DESCRIPTION

    The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are NOT threadable, respectively.

    Does not have mean or median function here. see SEE ALSO.

    SYNOPSIS

        use PDL::LiteF;
        use PDL::NiceSlice;
        use PDL::Stats::Basic;
    
        my $stdv = $data->stdv;

    or

        my $stdv = stdv( $data );

    FUNCTIONS

    stdv

      Signature: (a(n); float+ [o]b())

    Sample standard deviation.

    stdv does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    stdv_unbiased

      Signature: (a(n); float+ [o]b())

    Unbiased estimate of population standard deviation.

    stdv_unbiased does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    var

      Signature: (a(n); float+ [o]b())

    Sample variance.

    var does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    var_unbiased

      Signature: (a(n); float+ [o]b())

    Unbiased estimate of population variance.

    var_unbiased does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    se

      Signature: (a(n); float+ [o]b())

    Standard error of the mean. Useful for calculating confidence intervals.

        # 95% confidence interval for samples with large N
    
        $ci_95_upper = $data->average + 1.96 * $data->se;
        $ci_95_lower = $data->average - 1.96 * $data->se;

    se does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    ss

      Signature: (a(n); float+ [o]b())

    Sum of squared deviations from the mean.

    ss does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    skew

      Signature: (a(n); float+ [o]b())

    Sample skewness, measure of asymmetry in data. skewness == 0 for normal distribution.

    skew does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    skew_unbiased

      Signature: (a(n); float+ [o]b())

    Unbiased estimate of population skewness. This is the number in GNumeric Descriptive Statistics.

    skew_unbiased does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    kurt

      Signature: (a(n); float+ [o]b())

    Sample kurtosis, measure of "peakedness" of data. kurtosis == 0 for normal distribution.

    kurt does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    kurt_unbiased

      Signature: (a(n); float+ [o]b())

    Unbiased estimate of population kurtosis. This is the number in GNumeric Descriptive Statistics.

    kurt_unbiased does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    cov

      Signature: (a(n); b(n); float+ [o]c())

    Sample covariance. see corr for ways to call

    cov does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    cov_table

      Signature: (a(n,m); float+ [o]c(m,m))

    Square covariance table. Gives the same result as threading using cov but it calculates only half the square, hence much faster. And it is easier to use with higher dimension pdls.

    Usage:

        # 5 obs x 3 var, 2 such data tables
    
        perldl> $a = random 5, 3, 2
    
        perldl> p $cov = $a->cov_table
        [
         [
          [ 8.9636438 -1.8624472 -1.2416588]
          [-1.8624472  14.341514 -1.4245366]
          [-1.2416588 -1.4245366  9.8690655]
         ]
         [
          [   10.32644 -0.31311789 -0.95643674]
          [-0.31311789   15.051779  -7.2759577]
          [-0.95643674  -7.2759577   5.4465141]
         ]
        ]
        # diagonal elements of the cov table are the variances
        perldl> p $a->var
        [
         [ 8.9636438  14.341514  9.8690655]
         [  10.32644  15.051779  5.4465141]
        ]

    for the same cov matrix table using cov,

        perldl> p $a->dummy(2)->cov($a->dummy(1))

    cov_table does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    corr

      Signature: (a(n); b(n); float+ [o]c())

    Pearson correlation coefficient. r = cov(X,Y) / (stdv(X) * stdv(Y)).

    Usage:

        perldl> $a = random 5, 3
        perldl> $b = sequence 5,3
        perldl> p $a->corr($b)
    
        [0.20934208 0.30949881 0.26713007]

    for square corr table

        perldl> p $a->corr($a->dummy(1))
    
        [
         [           1  -0.41995259 -0.029301192]
         [ -0.41995259            1  -0.61927619]
         [-0.029301192  -0.61927619            1]
        ]

    but it is easier and faster to use corr_table.

    corr does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    corr_table

      Signature: (a(n,m); float+ [o]c(m,m))

    Square Pearson correlation table. Gives the same result as threading using corr but it calculates only half the square, hence much faster. And it is easier to use with higher dimension pdls.

    Usage:

        # 5 obs x 3 var, 2 such data tables
     
        perldl> $a = random 5, 3, 2
        
        perldl> p $a->corr_table
        [
         [
         [          1 -0.69835951 -0.18549048]
         [-0.69835951           1  0.72481605]
         [-0.18549048  0.72481605           1]
        ]
        [
         [          1  0.82722569 -0.71779883]
         [ 0.82722569           1 -0.63938828]
         [-0.71779883 -0.63938828           1]
         ]
        ]

    for the same result using corr,

        perldl> p $a->dummy(2)->corr($a->dummy(1))

    This is also how to use t_corr and n_pair with such a table.

    corr_table does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    t_corr

      Signature: (r(); n(); [o]t())
        $corr   = $data->corr( $data->dummy(1) );
        $n      = $data->n_pair( $data->dummy(1) );
        $t_corr = $corr->t_corr( $n );
    
        use PDL::GSL::CDF;
    
        $p_2tail = 2 * (1 - gsl_cdf_tdist_P( $t_corr->abs, $n-2 ));

    t significance test for Pearson correlations.

    t_corr does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    n_pair

      Signature: (a(n); b(n); int [o]c())

    Returns the number of good pairs between 2 lists. Useful with corr (esp. when bad values are involved)

    n_pair does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    corr_dev

      Signature: (a(n); b(n); float+ [o]c())
        $corr = $a->dev_m->corr_dev($b->dev_m);

    Calculates correlations from dev_m vals. Seems faster than doing corr from original vals when data pdl is big

    corr_dev does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    t_test

      Signature: (a(n); b(m); float+ [o]t(); [o]d())
        my ($t, $df) = t_test( $pdl1, $pdl2 );
    
        use PDL::GSL::CDF;
    
        my $p_2tail = 2 * (1 - gsl_cdf_tdist_P( $t->abs, $df ));

    Independent sample t-test, assuming equal var.

    t_test does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    t_test_nev

      Signature: (a(n); b(m); float+ [o]t(); [o]d())

    Independent sample t-test, NOT assuming equal var. ie Welch two sample t test. Df follows Welch-Satterthwaite equation instead of Satterthwaite (1946, as cited by Hays, 1994, 5th ed.). It matches GNumeric, which matches R.

        my ($t, $df) = $pdl1->t_test( $pdl2 );

    t_test_nev does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    t_test_paired

      Signature: (a(n); b(n); float+ [o]t(); [o]d())

    Paired sample t-test.

    t_test_paired does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    binomial_test

      Signature: (x(); n(); p_expected(); [o]p())

    Binomial test. One-tailed significance test for two-outcome distribution. Given the number of successes, the number of trials, and the expected probability of success, returns the probability of getting this many or more successes.

    This function does NOT currently support bad value in the number of successes.

    Usage:

      # assume a fair coin, ie. 0.5 probablity of getting heads
      # test whether getting 8 heads out of 10 coin flips is unusual
    
      my $p = binomial_test( 8, 10, 0.5 );  # 0.0107421875. Yes it is unusual.

    METHODS

    rtable

    Reads either file or file handle*. Returns observation x variable pdl and var and obs ids if specified. Ids in perl @ ref to allow for non-numeric ids. Other non-numeric entries are treated as missing, which are filled with $opt{MISSN} then set to BAD*. Can specify num of data rows to read from top but not arbitrary range.

    *If passed handle, it will not be closed here.

    *PDL::Bad::setvaltobad only works consistently with the default TYPE double before PDL-2.4.4_04.

    Default options (case insensitive):

        V       => 1,        # verbose. prints simple status
        TYPE    => double,
        C_ID    => 1,        # boolean. file has col id.
        R_ID    => 1,        # boolean. file has row id.
        R_VAR   => 0,        # boolean. set to 1 if var in rows
        SEP     => "\t",     # can take regex qr//
        MISSN   => -999,     # this value treated as missing and set to BAD
        NROW    => '',       # set to read specified num of data rows

    Usage:

    Sample file diet.txt:

        uid	height	weight	diet
        akw	72	320	1
        bcm	68	268	1
        clq	67	180	2
        dwm	70	200	2
      
        ($data, $idv, $ido) = rtable 'diet.txt';
    
        # By default prints out data info and @$idv index and element
    
        reading diet.txt for data and id... OK.
        data table as PDL dim o x v: PDL: Double D [4,3]
        0	height
        1	weight
        2	diet

    Another way of using it,

        $data = rtable( \*STDIN, {TYPE=>long} );

    group_by

    Returns pdl reshaped according to the specified factor variable. Most useful when used in conjunction with other threading calculations such as average, stdv, etc. When the factor variable contains unequal number of cases in each level, the returned pdl is padded with bad values to fit the level with the most number of cases. This allows the subsequent calculation (average, stdv, etc) to return the correct results for each level.

    Usage:

        # simple case with 1d pdl and equal number of n in each level of the factor
    
    	pdl> p $a = sequence 10
    	[0 1 2 3 4 5 6 7 8 9]
    
    	pdl> p $factor = $a > 4
    	[0 0 0 0 0 1 1 1 1 1]
    
    	pdl> p $a->group_by( $factor )->average
    	[2 7]
    
        # more complex case with threading and unequal number of n across levels in the factor
    
    	pdl> p $a = sequence 10,2
    	[
    	 [ 0  1  2  3  4  5  6  7  8  9]
    	 [10 11 12 13 14 15 16 17 18 19]
    	]
    
    	pdl> p $factor = qsort $a( ,0) % 3
    	[
    	 [0 0 0 0 1 1 1 2 2 2]
    	]
    
    	pdl> p $a->group_by( $factor )
    	[
    	 [
    	  [ 0  1  2  3]
    	  [10 11 12 13]
    	 ]
    	 [
    	  [  4   5   6 BAD]
    	  [ 14  15  16 BAD]
    	 ]
    	 [
    	  [  7   8   9 BAD]
    	  [ 17  18  19 BAD]
    	 ]
    	]
         ARRAY(0xa2a4e40)
    
        # group_by supports perl factors, multiple factors
        # returns factor labels in addition to pdl in array context
    
        pdl> p $a = sequence 12
        [0 1 2 3 4 5 6 7 8 9 10 11]
    
        pdl> $odd_even = [qw( e o e o e o e o e o e o )]
    
        pdl> $magnitude = [qw( l l l l l l h h h h h h )]
    
        pdl> ($a_grouped, $label) = $a->group_by( $odd_even, $magnitude )
    
        pdl> p $a_grouped
        [
         [
          [0 2 4]
          [1 3 5]
         ]
         [
          [ 6  8 10]
          [ 7  9 11]
         ]
        ]
    
        pdl> p Dumper $label
        $VAR1 = [
                  [
                    'e_l',
                    'o_l'
                  ],
                  [
                    'e_h',
                    'o_h'
                  ]
                ];

    which_id

    Lookup specified var (obs) ids in $idv ($ido) (see rtable) and return indices in $idv ($ido) as pdl if found. The indices are ordered by the specified subset. Useful for selecting data by var (obs) id.

        my $ind = which_id $ido, ['smith', 'summers', 'tesla'];
    
        my $data_subset = $data( $ind, );
    
        # take advantage of perl pattern matching
        # e.g. use data from people whose last name starts with s
    
        my $i = which_id $ido, [ grep { /^s/ } @$ido ];
    
        my $data_s = $data($i, );

    SEE ALSO

    PDL::Basic (hist for frequency counts)

    PDL::Ufunc (sum, avg, median, min, max, etc.)

    PDL::GSL::CDF (various cumulative distribution functions)

    REFERENCES

    Hays, W.L. (1994). Statistics (5th ed.). Fort Worth, TX: Harcourt Brace College Publishers.