PDL-Stats statistics modules in Perl Data Language

PDL::Stats::GLM

  • NAME
  • DESCRIPTION
  • SYNOPSIS
  • FUNCTIONS
  • METHODS
  • SEE ALSO
  • REFERENCES

    NAME

    PDL::Stats::GLM -- general and generalized linear modeling methods such as ANOVA, linear regression, PCA, and logistic regression.

    DESCRIPTION

    The terms FUNCTIONS and METHODS are arbitrarily used to refer to methods that are threadable and methods that are NOT threadable, respectively. FUNCTIONS except ols_t support bad value. PDL::Slatec strongly recommended for most METHODS, and it is required for logistic.

    P-values, where appropriate, are provided if PDL::GSL::CDF is installed.

    SYNOPSIS

        use PDL::LiteF;
        use PDL::NiceSlice;
        use PDL::Stats::GLM;
    
        # do a multiple linear regression and plot the residuals
    
        my $y = pdl( 8, 7, 7, 0, 2, 5, 0 );
    
        my $x = pdl( [ 0, 1, 2, 3, 4, 5, 6 ],        # linear component
                     [ 0, 1, 4, 9, 16, 25, 36 ] );   # quadratic component
    
        my %m  = $y->ols( $x, {plot=>1} );
    
        print "$_\t$m{$_}\n" for (sort keys %m);

    FUNCTIONS

    fill_m

      Signature: (a(n); float+ [o]b(n))

    Replaces bad values with sample mean. Mean is set to 0 if all obs are bad. Can be done inplace.

         perldl> p $data
         [
          [  5 BAD   2 BAD]
          [  7   3   7 BAD]
         ]
    
         perldl> p $data->fill_m
         [
          [      5     3.5       2     3.5]
          [      7       3       7 5.66667]
         ]

    The output pdl badflag is cleared.

    fill_rand

      Signature: (a(n); [o]b(n))

    Replaces bad values with random sample (with replacement) of good observations from the same variable. Can be done inplace.

        perldl> p $data
        [
         [  5 BAD   2 BAD]
         [  7   3   7 BAD]
        ]
        
        perldl> p $data->fill_rand
        
        [
         [5 2 2 5]
         [7 3 7 7]
        ]

    The output pdl badflag is cleared.

    dev_m

      Signature: (a(n); float+ [o]b(n))

    Replaces values with deviations from the mean. Can be done inplace.

    dev_m does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    stddz

      Signature: (a(n); float+ [o]b(n))

    Standardize ie replace values with z_scores based on sample standard deviation from the mean (replace with 0s if stdv==0). Can be done inplace.

    stddz does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    sse

      Signature: (a(n); b(n); float+ [o]c())

    Sum of squared errors between actual and predicted values.

    sse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    mse

      Signature: (a(n); b(n); float+ [o]c())

    Mean of squared errors between actual and predicted values, ie variance around predicted value.

    mse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    rmse

      Signature: (a(n); b(n); float+ [o]c())

    Root mean squared error, ie stdv around predicted value.

    rmse does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    pred_logistic

      Signature: (a(n,m); b(m); float+ [o]c(n))

    Calculates predicted prob value for logistic regression.

        # glue constant then apply coeff returned by the logistic method
    
        $pred = $x->glue(1,ones($x->dim(0)))->pred_logistic( $m{b} );

    pred_logistic does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    d0

      Signature: (a(n); float+ [o]c())
        my $d0 = $y->d0();

    Null deviance for logistic regression.

    d0 does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    dm

      Signature: (a(n); b(n); float+ [o]c())
        my $dm = $y->dm( $y_pred );
    
          # null deviance
        my $d0 = $y->dm( ones($y->nelem) * $y->avg );

    Model deviance for logistic regression.

    dm does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    dvrs

      Signature: (a(); b(); float+ [o]c())

    Deviance residual for logistic regression.

    dvrs does handle bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

    ols_t

    Threaded version of ordinary least squares regression (ols). The price of threading was losing significance tests for coefficients (but see r2_change). The fitting function was shamelessly copied then modified from PDL::Fit::Linfit. Uses PDL::Slatec when possible but otherwise uses PDL::MatrixOps. Intercept is LAST of coeff if CONST => 1.

    ols_t does not handle bad values. consider fill_m or fill_rand if there are bad values.

    Default options (case insensitive):

        CONST   => 1,

    Usage:

        # DV, 2 person's ratings for top-10 box office movies
        # ascending sorted by box office numbers
    
        perldl> p $y = qsort ceil( random(10, 2)*5 )    
        [
         [1 1 2 4 4 4 4 5 5 5]
         [1 2 2 2 3 3 3 3 5 5]
        ]
    
        # model with 2 IVs, a linear and a quadratic trend component
    
        perldl> $x = cat sequence(10), sequence(10)**2
    
        # suppose our novice modeler thinks this creates 3 different models
        # for predicting movie ratings
    
        perldl> p $x = cat $x, $x * 2, $x * 3
        [
         [
          [ 0  1  2  3  4  5  6  7  8  9]
          [ 0  1  4  9 16 25 36 49 64 81]
         ]
         [
          [  0   2   4   6   8  10  12  14  16  18]
          [  0   2   8  18  32  50  72  98 128 162]
         ]
         [
          [  0   3   6   9  12  15  18  21  24  27]
          [  0   3  12  27  48  75 108 147 192 243]
         ]
        ]
    
        perldl> p $x->info
        PDL: Double D [10,2,3]
    
        # insert a dummy dim between IV and the dim (model) to be threaded
    
        perldl> %m = $y->ols_t( $x->dummy(2) )
    
        perldl> p "$_\t$m{$_}\n" for (sort keys %m)
    
        # 2 persons' ratings, eached fitted with 3 "different" models
    
        F
        [
         [ 38.314159  25.087209]
         [ 38.314159  25.087209]
         [ 38.314159  25.087209]
        ]
    
        # df is the same across dv and iv models
     
        F_df    [2 7]
        F_p
        [
         [0.00016967051 0.00064215074]
         [0.00016967051 0.00064215074]
         [0.00016967051 0.00064215074]
        ]
        
        R2
        [
         [ 0.9162963 0.87756762]
         [ 0.9162963 0.87756762]
         [ 0.9162963 0.87756762]
        ]
    
        b
        [  # linear      quadratic     constant
         [
          [  0.99015152 -0.056818182   0.66363636]    # person 1
          [  0.18939394  0.022727273          1.4]    # person 2
         ]
         [
          [  0.49507576 -0.028409091   0.66363636]
          [  0.09469697  0.011363636          1.4]
         ]
         [
          [  0.33005051 -0.018939394   0.66363636]
          [ 0.063131313 0.0075757576          1.4]
         ]
        ]
    
        # our novice modeler realizes at this point that
        # the 3 models only differ in the scaling of the IV coefficients 
        
        ss_model
        [
         [ 20.616667  13.075758]
         [ 20.616667  13.075758]
         [ 20.616667  13.075758]
        ]
        
        ss_residual
        [
         [ 1.8833333  1.8242424]
         [ 1.8833333  1.8242424]
         [ 1.8833333  1.8242424]
        ]
        
        ss_total        [22.5 14.9]
        y_pred
        [
         [
          [0.66363636  1.5969697  2.4166667  3.1227273  ...  4.9727273]
        ...

    r2_change

    Significance test for the incremental change in R2 when new variable(s) are added to an ols regression model. Returns the change stats as well as stats for both models. Based on ols_t. (One way to make up for the lack of significance tests for coeffs in ols_t).

    Default options (case insensitive):

        CONST   => 1,

    Usage:

        # suppose these are two persons' ratings for top 10 box office movies
        # ascending sorted by box office
    
        perldl> p $y = qsort ceil(random(10, 2) * 5)
        [
         [1 1 2 2 2 3 4 4 4 4]
         [1 2 2 3 3 3 4 4 5 5]
        ]
    
        # first IV is a simple linear trend
    
        perldl> p $x1 = sequence 10
        [0 1 2 3 4 5 6 7 8 9]
    
        # the modeler wonders if adding a quadratic trend improves the fit
    
        perldl> p $x2 = sequence(10) ** 2
        [0 1 4 9 16 25 36 49 64 81]
    
        # two difference models are given in two pdls
        # each as would be pass on to ols_t
        # the 1st model includes only linear trend
        # the 2nd model includes linear and quadratic trends
        # when necessary use dummy dim so both models have the same ndims
    
        perldl> %c = $y->r2_change( $x1->dummy(1), cat($x1, $x2) )
    
        perldl> p "$_\t$c{$_}\n" for (sort keys %c)
          #              person 1   person 2
        F_change        [0.72164948 0.071283096]
          # df same for both persons
        F_df    [1 7]
        F_p     [0.42370145 0.79717232]
        R2_change       [0.0085966043 0.00048562549]
        model0  HASH(0x8c10828)
        model1  HASH(0x8c135c8)
       
        # the answer here is no.

    METHODS

    anova

    Analysis of variance. Uses type III sum of squares for unbalanced data.

    Dependent variable should be a 1D pdl. Independent variables can be passed as 1D perl array ref or 1D pdl.

    Supports bad value (by ignoring missing or BAD values in dependent and independent variables list-wise).

    Default options (case insensitive):

        V      => 1,          # carps if bad value in variables
        IVNM   => [],         # auto filled as ['IV_0', 'IV_1', ... ]
        PLOT   => 1,          # plots highest order effect
                              # can set plot_means options here

    Usage:

        # suppose this is ratings for 12 apples
    
        perldl> p $y = qsort ceil( random(12)*5 )
        [1 1 2 2 2 3 3 4 4 4 5 5]
        
        # IV for types of apple
    
        perldl> p $a = sequence(12) % 3 + 1
        [1 2 3 1 2 3 1 2 3 1 2 3]
    
        # IV for whether we baked the apple
        
        perldl> @b = qw( y y y y y y n n n n n n )
    
        perldl> %m = $y->anova( $a, \@b, { IVNM=>['apple', 'bake'] } )
        
        perldl> p "$_\t$m{$_}\n" for (sort keys %m)
        # apple # m
        [
         [2.5   3 3.5]
        ]
        
        # apple # se
        [
         [0.64549722 0.91287093 0.64549722]
        ]
        
        # apple ~ bake # m
        [
         [1.5 1.5 2.5]
         [3.5 4.5 4.5]
        ]
        
        # apple ~ bake # se
        [
         [0.5 0.5 0.5]
         [0.5 0.5 0.5]
        ]
        
        # bake # m
        [
         [ 1.8333333  4.1666667]
        ]
        
        # bake # se
        [
         [0.30731815 0.30731815]
        ]
        
        F       7.6
        F_df    [5 6]
        F_p     0.0141586545851857
        ms_model        3.8
        ms_residual     0.5
        ss_model        19
        ss_residual     3
        ss_total        22
        | apple | F     2
        | apple | F_df  [2 6]
        | apple | F_p   0.216
        | apple | ms    1
        | apple | ss    2
        | apple ~ bake | F      0.666666666666667
        | apple ~ bake | F_df   [2 6]
        | apple ~ bake | F_p    0.54770848985725
        | apple ~ bake | ms     0.333333333333334
        | apple ~ bake | ss     0.666666666666667
        | bake | F      32.6666666666667
        | bake | F_df   [1 6]
        | bake | F_p    0.00124263849516693
        | bake | ms     16.3333333333333
        | bake | ss     16.3333333333333

    anova_rptd

    Repeated measures and mixed model anova. Uses type III sum of squares. The standard error (se) for the means are based on the relevant mean squared error from the anova, ie it is pooled across levels of the effect.

    anova_rptd supports bad value in the dependent and independent variables. It automatically removes bad data listwise, ie remove a subject's data if there is any cell missing for the subject.

    Default options (case insensitive):

        V      => 1,    # carps if bad value in dv
        IVNM   => [],   # auto filled as ['IV_0', 'IV_1', ... ]
        BTWN   => [],   # indices of between-subject IVs (matches IVNM indices)
        PLOT   => 1,    # plots highest order effect
                        # see plot_means() for more options

    Usage:

        Some fictional data: recall_w_beer_and_wings.txt
      
        Subject Beer    Wings   Recall
        Alex    1       1       8
        Alex    1       2       9
        Alex    1       3       12
        Alex    2       1       7
        Alex    2       2       9
        Alex    2       3       12
        Brian   1       1       12
        Brian   1       2       13
        Brian   1       3       14
        Brian   2       1       9
        Brian   2       2       8
        Brian   2       3       14
        ...
      
          # rtable allows text only in 1st row and col
        my ($data, $idv, $subj) = rtable 'recall_w_beer_and_wings.txt';
      
        my ($b, $w, $dv) = $data->dog;
          # subj and IVs can be 1d pdl or @ ref
          # subj must be the first argument
        my %m = $dv->anova_rptd( $subj, $b, $w, {ivnm=>['Beer', 'Wings']} );
      
        print "$_\t$m{$_}\n" for (sort keys %m);
    
        # Beer # m	
        [
         [ 10.916667  8.9166667]
        ]
        
        # Beer # se	
        [
         [ 0.4614791  0.4614791]
        ]
        
        # Beer ~ Wings # m	
        [
         [   10     7]
         [ 10.5  9.25]
         [12.25  10.5]
        ]
        
        # Beer ~ Wings # se	
        [
         [0.89170561 0.89170561]
         [0.89170561 0.89170561]
         [0.89170561 0.89170561]
        ]
        
        # Wings # m	
        [
         [   8.5  9.875 11.375]
        ]
        
        # Wings # se	
        [
         [0.67571978 0.67571978 0.67571978]
        ]
        
        ss_residual	19.0833333333333
        ss_subject	24.8333333333333
        ss_total	133.833333333333
        | Beer | F	9.39130434782609
        | Beer | F_p	0.0547977008378944
        | Beer | df	1
        | Beer | ms	24
        | Beer | ss	24
        | Beer || err df	3
        | Beer || err ms	2.55555555555556
        | Beer || err ss	7.66666666666667
        | Beer ~ Wings | F	0.510917030567687
        | Beer ~ Wings | F_p	0.623881438624431
        | Beer ~ Wings | df	2
        | Beer ~ Wings | ms	1.625
        | Beer ~ Wings | ss	3.25000000000001
        | Beer ~ Wings || err df	6
        | Beer ~ Wings || err ms	3.18055555555555
        | Beer ~ Wings || err ss	19.0833333333333
        | Wings | F	4.52851711026616
        | Wings | F_p	0.0632754786153548
        | Wings | df	2
        | Wings | ms	16.5416666666667
        | Wings | ss	33.0833333333333
        | Wings || err df	6
        | Wings || err ms	3.65277777777778
        | Wings || err ss	21.9166666666667

    For mixed model anova, ie when there are between-subject IVs involved, feed the IVs as above, but specify in BTWN which IVs are between-subject. For example, if we had added age as a between-subject IV in the above example, we would do

        my %m = $dv->anova_rptd( $subj, $age, $b, $w,
                               { ivnm=>['Age', 'Beer', 'Wings'], btwn=>[0] });

    dummy_code

    Dummy coding of nominal variable (perl @ ref or 1d pdl) for use in regression.

    Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

        perldl> @a = qw(a a a b b b c c c)
        perldl> p $a = dummy_code(\@a)
        [
         [1 1 1 0 0 0 0 0 0]
         [0 0 0 1 1 1 0 0 0]
        ]

    effect_code

    Unweighted effect coding of nominal variable (perl @ ref or 1d pdl) for use in regression. returns in @ context coded pdl and % ref to level - pdl->dim(1) index.

    Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

        my @var = qw( a a a b b b c c c );
        my ($var_e, $map) = effect_code( \@var );
    
        print $var_e . $var_e->info . "\n";
        
        [
         [ 1  1  1  0  0  0 -1 -1 -1]
         [ 0  0  0  1  1  1 -1 -1 -1]
        ]    
        PDL: Double D [9,2]
    
        print "$_\t$map->{$_}\n" for (sort keys %$map)
        a       0
        b       1
        c       2

    effect_code_w

    Weighted effect code for nominal variable. returns in @ context coded pdl and % ref to level - pdl->dim(1) index.

    Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

        perldl> @a = qw( a a b b b c c )
        perldl> p $a = effect_code_w(\@a)
        [
         [   1    1    0    0    0   -1   -1]
         [   0    0    1    1    1 -1.5 -1.5]
        ]

    interaction_code

    Returns the coded interaction term for effect-coded variables.

    Supports BAD value (missing or 'BAD' values result in the corresponding pdl elements being marked as BAD).

        perldl> $a = sequence(6) > 2      
        perldl> p $a = $a->effect_code
        [
         [ 1  1  1 -1 -1 -1]
        ]
        
        perldl> $b = pdl( qw( 0 1 2 0 1 2 ) )            
        perldl> p $b = $b->effect_code
        [
         [ 1  0 -1  1  0 -1]
         [ 0  1 -1  0  1 -1]
        ]
        
        perldl> p $ab = interaction_code( $a, $b )
        [
         [ 1  0 -1 -1 -0  1]
         [ 0  1 -1 -0 -1  1]
        ]

    ols

    Ordinary least squares regression, aka linear regression. Unlike ols_t, ols is not threadable, but it can handle bad value (by ignoring observations with bad value in dependent or independent variables list-wise) and returns the full model in list context with various stats.

    IVs ($x) should be pdl dims $y->nelem or $y->nelem x n_iv. Do not supply the constant vector in $x. Intercept is automatically added and returned as LAST of the coeffs if CONST=>1. Returns full model in list context and coeff in scalar context.

    Default options (case insensitive):

        CONST  => 1,
        PLOT   => 1,   # see plot_residuals() for plot options

    Usage:

        # suppose this is a person's ratings for top 10 box office movies
        # ascending sorted by box office
    
        perldl> p $y = qsort ceil( random(10) * 5 )
        [1 1 2 2 2 2 4 4 5 5]
    
        # construct IV with linear and quadratic component
    
        perldl> p $x = cat sequence(10), sequence(10)**2
        [
         [ 0  1  2  3  4  5  6  7  8  9]
         [ 0  1  4  9 16 25 36 49 64 81]
        ]
    
        perldl> %m = $y->ols( $x )
    
        perldl> p "$_\t$m{$_}\n" for (sort keys %m)
    
        F       40.4225352112676
        F_df    [2 7]
        F_p     0.000142834216344756
        R2      0.920314253647587
     
        # coeff  linear     quadratic  constant
     
        b       [0.21212121 0.03030303 0.98181818]
        b_p     [0.32800118 0.20303404 0.039910509]
        b_se    [0.20174693 0.021579989 0.38987581]
        b_t     [ 1.0514223   1.404219  2.5182844]
        ss_model        19.8787878787879
        ss_residual     1.72121212121212
        ss_total        21.6
        y_pred  [0.98181818  1.2242424  1.5272727  ...  4.6181818  5.3454545]

    ols_rptd

    Repeated measures linear regression (Lorch & Myers, 1990; Van den Noortgate & Onghena, 2006). Handles purely within-subject design for now. See t/stats_ols_rptd.t for an example using the Lorch and Myers data.

    Usage:

        # This is the example from Lorch and Myers (1990),
        # a study on how characteristics of sentences affected reading time
        # Three within-subject IVs:
        # SP -- serial position of sentence
        # WORDS -- number of words in sentence
        # NEW -- number of new arguments in sentence
    
        # $subj can be 1D pdl or @ ref and must be the first argument
        # IV can be 1D @ ref or pdl
        # 1D @ ref is effect coded internally into pdl
        # pdl is left as is
    
        my %r = $rt->ols_rptd( $subj, $sp, $words, $new );
    
        print "$_\t$r{$_}\n" for (sort keys %r);
    
        (ss_residual)   58.3754646504336
        (ss_subject)    51.8590337714286
        (ss_total)  405.188241771429
        #      SP        WORDS      NEW
        F   [  7.208473  61.354153  1.0243311]
        F_p [0.025006181 2.619081e-05 0.33792837]
        coeff   [0.33337285 0.45858933 0.15162986]
        df  [1 1 1]
        df_err  [9 9 9]
        ms  [ 18.450705  73.813294 0.57026483]
        ms_err  [ 2.5595857  1.2030692 0.55671923]
        ss  [ 18.450705  73.813294 0.57026483]
        ss_err  [ 23.036272  10.827623  5.0104731]

    logistic

    Logistic regression with maximum likelihood estimation using PDL::Fit::LM (requires PDL::Slatec. Hence loaded with "require" in the sub instead of "use" at the beginning).

    IVs ($x) should be pdl dims $y->nelem or $y->nelem x n_iv. Do not supply the constant vector in $x. It is included in the model and returned as LAST of coeff. Returns full model in list context and coeff in scalar context.

    The significance tests are likelihood ratio tests (-2LL deviance) tests. IV significance is tested by comparing deviances between the reduced model (ie with the IV in question removed) and the full model.

    ***NOTE: the results here are qualitatively similar to but not identical with results from R, because different algorithms are used for the nonlinear parameter fit. Use with discretion***

    Default options (case insensitive):

        INITP => zeroes( $x->dim(1) + 1 ),    # n_iv + 1
        MAXIT => 1000,
        EPS   => 1e-7,

    Usage:

        # suppose this is whether a person had rented 10 movies
    
        perldl> p $y = ushort( random(10)*2 )
        [0 0 0 1 1 0 0 1 1 1]
    
        # IV 1 is box office ranking
    
        perldl> p $x1 = sequence(10)
        [0 1 2 3 4 5 6 7 8 9]
    
        # IV 2 is whether the movie is action- or chick-flick
    
        perldl> p $x2 = sequence(10) % 2
        [0 1 0 1 0 1 0 1 0 1]
    
        # concatenate the IVs together
    
        perldl> p $x = cat $x1, $x2
        [
         [0 1 2 3 4 5 6 7 8 9]
         [0 1 0 1 0 1 0 1 0 1]
        ]
    
        perldl> %m = $y->logistic( $x )
    
        perldl> p "$_\t$m{$_}\n" for (sort keys %m)
    
        D0	13.8629436111989
        Dm	9.8627829791575
        Dm_chisq	4.00016063204141
        Dm_df	2
        Dm_p	0.135324414081692
          #  ranking    genre      constant
        b	[0.41127706 0.53876358 -2.1201285]
        b_chisq	[ 3.5974504 0.16835559  2.8577151]
        b_p	[0.057868258  0.6815774 0.090936587]
        iter	12
        y_pred	[0.10715577 0.23683909 ... 0.76316091 0.89284423]

    pca

    Principal component analysis. Based on corr instead of cov (bad values are ignored pair-wise. OK when bad values are few but otherwise probably should fill_m etc before pca). Use PDL::Slatec::eigsys() if installed, otherwise use PDL::MatrixOps::eigens_sym().

    Default options (case insensitive):

        CORR  => 1,     # boolean. use correlation or covariance
        PLOT  => 1,     # calls plot_screes by default
                        # can set plot_screes options here

    Usage:

        my $d = qsort random 10, 5;      # 10 obs on 5 variables
        my %r = $d->pca( \%opt );
        print "$_\t$r{$_}\n" for (keys %r);
    
        eigenvalue    # variance accounted for by each component
        [4.70192 0.199604 0.0471421 0.0372981 0.0140346]
    
        eigenvector   # dim var x comp. weights for mapping variables to component
        [
         [ -0.451251  -0.440696  -0.457628  -0.451491  -0.434618]
         [ -0.274551   0.582455   0.131494   0.255261  -0.709168]
         [   0.43282   0.500662  -0.139209  -0.735144 -0.0467834]
         [  0.693634  -0.428171   0.125114   0.128145  -0.550879]
         [  0.229202   0.180393  -0.859217     0.4173  0.0503155]
        ]
        
        loadings      # dim var x comp. correlation between variable and component
        [
         [ -0.978489  -0.955601  -0.992316   -0.97901  -0.942421]
         [ -0.122661   0.260224  0.0587476   0.114043  -0.316836]
         [ 0.0939749   0.108705 -0.0302253  -0.159616 -0.0101577]
         [   0.13396 -0.0826915  0.0241629  0.0247483   -0.10639]
         [  0.027153  0.0213708  -0.101789  0.0494365 0.00596076]
        ]
        
        pct_var       # percent variance accounted for by each component
        [0.940384 0.0399209 0.00942842 0.00745963 0.00280691]

    Plot scores along the first two components,

        $d->plot_scores( $r{eigenvector} );

    pca_sorti

    Determine by which vars a component is best represented. Descending sort vars by size of association with that component. Returns sorted var and relevant component indices.

    Default options (case insensitive):

        NCOMP => 10,     # maximum number of components to consider

    Usage:

          # let's see if we replicated the Osgood et al. (1957) study
        perldl> ($data, $idv, $ido) = rtable 'osgood_exp.csv', {v=>0}
    
          # select a subset of var to do pca
        perldl> $ind = which_id $idv, [qw( ACTIVE BASS BRIGHT CALM FAST GOOD HAPPY HARD LARGE HEAVY )]
        perldl> $data = $data( ,$ind)->sever
        perldl> @$idv = @$idv[list $ind]
    
        perldl> %m = $data->pca
     
        perldl> ($iv, $ic) = $m{loadings}->pca_sorti()
    
        perldl> p "$idv->[$_]\t" . $m{loadings}->($_,$ic)->flat . "\n" for (list $iv)
    
                 #   COMP0     COMP1    COMP2    COMP3
        HAPPY	[0.860191 0.364911 0.174372 -0.10484]
        GOOD	[0.848694 0.303652 0.198378 -0.115177]
        CALM	[0.821177 -0.130542 0.396215 -0.125368]
        BRIGHT	[0.78303 0.232808 -0.0534081 -0.0528796]
        HEAVY	[-0.623036 0.454826 0.50447 0.073007]
        HARD	[-0.679179 0.0505568 0.384467 0.165608]
        ACTIVE	[-0.161098 0.760778 -0.44893 -0.0888592]
        FAST	[-0.196042 0.71479 -0.471355 0.00460276]
        LARGE	[-0.241994 0.594644 0.634703 -0.00618055]
        BASS	[-0.621213 -0.124918 0.0605367 -0.765184]

    plot_means

    Plots means anova style. Can handle up to 4-way interactions (ie 4D pdl).

    Default options (case insensitive):

        IVNM  => ['IV_0', 'IV_1', 'IV_2', 'IV_3'],
        DVNM  => 'DV',
        AUTO  => 1,       # auto set dims to be on x-axis, line, panel
                          # if set 0, dim 0 goes on x-axis, dim 1 as lines
                          # dim 2+ as panels
          # see PDL::Graphics::PGPLOT::Window for next options
        WIN   => undef,   # pgwin object. not closed here if passed
                          # allows comparing multiple lines in same plot
                          # set env before passing WIN
        DEV   => '/xs',         # open and close dev for plotting if no WIN
                                # defaults to '/png' in Windows
        SIZE  => 640,           # individual square panel size in pixels
        SYMBL => [0, 4, 7, 11],

    Usage:

          # see anova for mean / se pdl structure
        $mean->plot_means( $se, {IVNM=>['apple', 'bake']} );

    Or like this:

        $m{'# apple ~ bake # m'}->plot_means;

    plot_residuals

    Plots residuals against predicted values.

    Usage:

        $y->plot_residuals( $y_pred, { dev=>'/png' } );

    Default options (case insensitive):

         # see PDL::Graphics::PGPLOT::Window for more info
        WIN   => undef,  # pgwin object. not closed here if passed
                         # allows comparing multiple lines in same plot
                         # set env before passing WIN
        DEV   => '/xs',  # open and close dev for plotting if no WIN
                         # defaults to '/png' in Windows
        SIZE  => 640,    # plot size in pixels
        COLOR => 1,

    plot_scores

    Plots standardized original and PCA transformed scores against two components. (Thank you, Bob MacCallum, for the documentation suggestion that led to this function.)

    Default options (case insensitive):

      CORR  => 1,      # boolean. PCA was based on correlation or covariance
      COMP  => [0,1],  # indices to components to plot
        # see PDL::Graphics::PGPLOT::Window for next options
      WIN   => undef,  # pgwin object. not closed here if passed
                       # allows comparing multiple lines in same plot
                       # set env before passing WIN
      DEV   => '/xs',  # open and close dev for plotting if no WIN
                       # defaults to '/png' in Windows
      SIZE  => 640,    # plot size in pixels
      COLOR => [1,2],  # color for original and rotated scores

    Usage:

      my %p = $data->pca();
      $data->plot_scores( $p{eigenvector}, \%opt );

    plot_screes

    Scree plot. Plots proportion of variance accounted for by PCA components.

    Default options (case insensitive):

      NCOMP => 20,     # max number of components to plot
      CUT   => 0,      # set to plot cutoff line after this many components
                       # undef to plot suggested cutoff line for NCOMP comps
       # see PDL::Graphics::PGPLOT::Window for next options
      WIN   => undef,  # pgwin object. not closed here if passed
                       # allows comparing multiple lines in same plot
                       # set env before passing WIN
      DEV   => '/xs',  # open and close dev for plotting if no WIN
                       # defaults to '/png' in Windows
      SIZE  => 640,    # plot size in pixels
      COLOR => 1,

    Usage:

      # variance should be in descending order
     
      $pca{var}->plot_screes( {ncomp=>16} );

    Or, because NCOMP is used so often, it is allowed a shortcut,

      $pca{var}->plot_screes( 16 );

    SEE ALSO

    PDL::Fit::Linfit

    PDL::Fit::LM

    REFERENCES

    Cohen, J., Cohen, P., West, S.G., & Aiken, L.S. (2003). Applied Multiple Regression/correlation Analysis for the Behavioral Sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.

    Hosmer, D.W., & Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York, NY: Wiley-Interscience.

    Lorch, R.F., & Myers, J.L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, & Cognition, 16, 149-157.

    Osgood C.E., Suci, G.J., & Tannenbaum, P.H. (1957). The Measurement of Meaning. Champaign, IL: University of Illinois Press.

    Rutherford, A. (2001). Introducing Anova and Ancova: A GLM Approach (1st ed.). Thousand Oaks, CA: Sage Publications.

    Shlens, J. (2009). A Tutorial on Principal Component Analysis. Retrieved April 10, 2011 from http://citeseerx.ist.psu.edu/

    The GLM procedure: unbalanced ANOVA for two-way design with interaction. (2008). SAS/STAT(R) 9.2 User's Guide. Retrieved June 18, 2009 from http://support.sas.com/

    Van den Noortgatea, W., & Onghenaa, P. (2006). Analysing repeated measures data in cognitive research: A comment on regression coefficient analyses. European Journal of Cognitive Psychology, 18, 937-952.