Talk:Principal components analysis

From Wikipedia, the free encyclopedia

This article may be too technical for a general audience.
Please help improve this article by providing more context and better explanations of technical details to make it more accessible, without removing technical details. See below for more information.

The article seems terribly cluttered. In particular, I dislike the table of symbols. Sboehringer

[edit] Question on reduced-space data matrix

The article states: and then obtaining the reduced-space data matrix Y by projecting X down into the reduced space defined by only the first L singular vectors, W_L:

$\mathbf{Y}=\mathbf{W_L}^T\mathbf{X} = \mathbf{\Sigma_L}\mathbf{V_L}^T$

I believe that the correct formula is:

$\mathbf{Y}=\mathbf{X}\mathbf{V_L} = \mathbf{W_L}\mathbf{\Sigma_L}$

Can anyone verify this? —The preceding unsigned comment was added by 216.113.168.141 (talk • contribs).

Afraid not. The way things are set up in the article, the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows.

With the reduced space, we want to find a smaller set of L new variables, which for each sampling preserves as much of the information as possible out of the original M variables.

So we're looking for an L x N matrix, with the same number of columns (the same number of samples), but a smaller number of rows (so each sample is described by fewer variables).

Matrix W_L is an M x L matrix, so W_L Σ_L is also an M x L matrix - not the shape we're looking for. But Σ_L V_L^T is the desired L x N shape.

Hope this helps. -- Jheald 11:02, 17 June 2006 (UTC).

Yes, that clarifies. Thanks Jheald! I thought that X row vectors were the sampling events and the column vectors were the variables -- since the definition of X is in fact the transpose of what I thought, then everything makes sense. -- 12:33, 26 June 2006

[edit] Separate articles on arg max and arg min notations

We probably need a small article on the arg max and arg min notations.

[edit] Missing crucial details

The article seems to be missing crucial details. I can't see where the actual dimension reduction is happening. Is the idea that you have several samples of the measurement vector x and you use these to estimate the expectations? 130.188.8.9 16:49, 20 Aug 2003 (UTC)

- There should now be a clue. However, the article still needs work

[edit] Plural versus singular title

Principle components analysis is better known as Principle component analysis (singular). This should be the main title and the plural form a synonym referring to this page (Unfortunately I do not know how to do it).

I've always heard it with the plural. I have a PhD in statistics. I'm not saying the singular could never be used, but the plural is certainly the one that's frequently heard. Michael Hardy 21:18, 22 Mar 2004 (UTC)

The only monography solely dedicated to PCA is from Jolliffe to my knowledge and is titled "Principal component analysis". The naming issue is discussed in the introduction otherwise than you indicate. Then again naming issues are conventions and vary across the globe. Sboehringer

Google says: "Principal component analysis": 103,000 hits, "Principal components analysis": 46,300 hits. MH 13:48, 25 Mar 2004 (UTC)

I have that monograph and you are correct. It seems, however, that the analysis elucidates the principal components, plural, and so unless one is only interested in one principal component at a time, the plural appears to be more appropriate.

[edit] Article needs serious improvement

Moving Michael Hardy's comments to Talk:

This article needs some serious revamping, to say the least. One cannot assume without loss of generality that the expectation is zero. If the expectation were observable, one could subtract it from x and get something with zero expectation, and so no generality would be lost by this assumption. In practice the expecation is never observable, and one must consider the probability distribution of the difference between x and an estimate, based on data, of the expectation of x.

Excuse me, but that is absurd. If the mean were observable, then one could simply subtract the mean from X, getting something with zero mean, and then indeed no generality would be lost by assuming that. In practice, one must use a data-based and therefore uncertain estimate of the mean, and one must therefore consider the probability distribution of the difference between X and the estimate of the mean of X.

If I may respond --- PCA is a technique that is applied to empirical data sets. PCA eigendecomposes the maximum likelihood covariance matrix. Indeed, there is a distribution of PCA decompositions about the "true" decomposition that you would get in the infinite data limit. But, that does not make it absurd. Or rather, no more absurd than any other maximum likelihood estimate. Any ML technique will have a variance around the estimate from infinite data.

Are you objecting because ML is not mentioned in the article? Or is it something else? -- hike395 04:39, 5 May 2004 (UTC)

Something else. Several something elses. It doesn't seem like that good an article. I'll probably drastically edit it within a few months; it's on my list. Michael Hardy 16:31, 5 May 2004 (UTC)

[edit] PCR and PLS?

would it be redundant to include some discussion of principal components regression? i don't think so, but i don't feel qualified to explain it.

It would also be nice to have a piece on Partial Least Squares. Geladi and Kowalski Analytica Chimica Acta 185 (1986) 1-17 may serve as a starting point.

I disagree --- PLS and PCR are both forms of linear regression, which is supervised learning. PCA is density estimation, which is unsupervised learning. Very different sorts of algorithms --- hike395 04:35, 22 Mar 2005 (UTC)

[edit] PCA & Least Squares

Is PCA the same as a least squares fit? (Furthermore, is either the same as finding the principle moment of inertia of an n-dimensional body?) —BenFrantzDale 23:53, August 3, 2005 (UTC)

No. A least-squares fit minimizes (the squares of) the residuals, the vertical distances from the fit line (hyperplane) to the data. PCA minimizes the orthogonal projections to the hyperplane. (Or something like that; I don't really know what I'm talking about.) As for moments of inertia, well, physics isn't exactly my area of expertise. —Caesura ^(t) 18:44, 14 December 2005 (UTC)

Yes. PCA is equivalent to finding the principal axes of inertia for N point masses in m dimensions, and then throwing all but l of the new transformed co-ordinates away. It's also mathematically the same problem as Total Least Squares (errors in all variables), rather than Ordinary Least Squares (errors only in y, not x), if you can scale it so the errors in all the variables are uncorrelated and the same size. You're then finding the best l dimensional hyperplane that your data ought to sit on through the m dimensional space. The real power tool behind all of this to get a feel for is Singular Value Decomposition. PCA is just SVD applied to your data. -- Jheald 19:40, 12 January 2006 (UTC).

[edit] Derivation of PCA

Shouldn't the constraint that we are looking for the maximum variance appear somewhere in that derivation ? I cannot understand it clearly as it is right now. --Raistlin 12:49, 24 August 2005 (UTC)

[edit] Conjugate transpose

and * T represents the conjugate transpose operation.

Why conjugate transpose instead of a normal transpose ? Does it even work with complex numbers ? Taw 04:18, 31 December 2005 (UTC)

As you probably know, conjugate transpose is a generalization of plain old transpose that allows these operations to work on complex numbers instead of just real numbers. If the source data X consists entirely of real numbers, then the conjugate operation is completely transparent, since the conjugate of a real number is the number itself. But if the source data includes complex numbers, then the conjugate operations is absolutely essential for the matrix operations to yield meaningful results. As far as I can tell, it does work on complex numbers. As an example where you might have complex numbers as source data, you might want to use PCA on the Fourier components of a real, discrete-time signal, which are in general complex. -- Metacomet 18:59, 1 January 2006 (UTC)

I have added a motivation paragraph at Conjugate_transpose#Motivation to try to show why it is so natural for the conjugate transpose to turn up, whenever the matrix you're transposing includes complex numbers. Hope it's helpful. -- Jheald 20:14, 12 January 2006 (UTC).

[edit] Computation -- surely this is not the right way to go ?

The section on computation looks to make a real meal of things, IMO; and to be pretty dubious too, as regards its numerical analysis. As soon as you square the data matrix, you're going to reduce the accuracy of your SVD from double precision to single precision.

Is there any reason to prefer either of the methods in the text, compared to choosing which bits of the SVD you actually want to keep, and then just wheeling out R-SVD ? (Which I imagine is quicker, too). -- Jheald 19:05, 12 January 2006 (UTC).

I agree that this article is unreadable. The lengthy "PCA algorithm" section is one of the main reasons - it is too long, and it doesn't agree with the equations in the introduction (where did we divide by N-1? why? what about the empirical standard deviations?). It doesn't even say what the output of the algorithm is, AFAICT. A5 13:32, 6 March 2006 (UTC)

I am working on improving the algorithm section to make it more readable. In the end, the section will still be quite long, because the algorithm is rather complicated and I think it is important to include enough detail so that people can actually implement it in software. After I have completed this upgrade, please make specific suggestions for further improvements. -- Metacomet 21:37, 9 March 2006 (UTC)

I am done for now. There is still more work to do, but it's a good start. Please provide comments and suggestions for improvement. Thanks. -- Metacomet 23:12, 9 March 2006 (UTC)

The improvement I would suggest is to delete the whole entire section completely, starting from the table, and then everything following it; and instead tell people to use SVD.

A standard SVD routine will be better written, better tested, faster, and more numerically stable.

IMO it is totally irresponsible for the article to be suggesting inefficient homespun routines, actually leading people away from the standard SVD routines. -- Jheald 00:03, 10 March 2006 (UTC).

I'm no expert Jheald, but I don't see what you're so worried about. Algorithms for SVD that I have seen on the WWW basically consist of the same algorithm that is listed on this PCA page, only done twice, once for left handed eigenvalues, once for right. Is there some other algorithm for SVD that is much preferable? --Chinasaur 08:40, 25 May 2006 (UTC)

I am really glad that you took some time to carefully review the work that I did and make some thoughtful recommendations. Thanks for the constructive feedback. Oh yes, that is sarcasm, in case you were wondering. -- Metacomet 00:48, 10 March 2006 (UTC)

"...totally irresponsible..." Don't you think that is just a wee bit of hyperbole?

"...homespun routines..." Are you referring to calculating the mean, the standard deviation, or the covariance? No, that can't be right, those are well-known and well-established procedures from statistics. Or perhaps eigenvectors and eigenvalues? Hmmmm, those are standard routines in linear algebra. Sorting the basis vectors by energy content and keeping only the ones with the highest contribution? No, that's also a standard concept called the 80-20 rule (or Pareto's principle). I guess I just don't understand what you mean by homespun routines....

-- Metacomet 01:36, 10 March 2006 (UTC)

BTW, I am pretty sure that dividing by N-1 is correct, which means the introduction needs to be fixed, not the algorithm. The reason the algorithm needs to divide by N-1 is that it is computing the expected value of the product, not the product itself. -- Metacomet 21:50, 9 March 2006 (UTC)

I dont know nothing abouth Maths, but all the pages about the Covariance Matrix use N so maybe N-1 is not so correct...? -- IC 18:48, 18 November 2006 (GMT+1)

A mathematical derivation with eigenvalues and eigenvectors is OK but such methods should not be called algorithms. The practical computation should be SVD. Squaring the matrix to get the covariance is harmful. By the way, homegrown SVD is harmful as well and I must support Jheald on both counts. A professional implementation should use SVD from some efficient and stable library, just as one should never write matrix-matrix multiplication except in a college homework. LAPACK is a de-facto standard for all that. ~~(There might be justified exceptions such as programming something exotic with small memory.)~~ Even if some engineering textbooks might have algorithms like here with the covariance created explicitly, their authors are obviously not professional numerical analysts or developers or they do not care about numerical aspects. Jmath666 07:28, 16 March 2007 (UTC)

[edit] Simplification

Could someone put one sentence at the top explaining this in layman's terms? It looks to me like a very fancy and statistically smart way to average a whole heap of data into some sort of dataset common to all of the data -- is this at all a correct impression? --Fastfission 04:07, 28 January 2006 (UTC)

[edit] Cov Matrix

If one is dealing with a MxN data set, i.e N factors and M obervations of each, the resulting cov matrix will be a NxN, not MxM.

It seems like everything from the mean vector subtraction to the covariance matrix calculation is done as if the data are organized as M rows of variables and N columns of observations. This is not properly explained in the "organizing the data" section, and is kind of opposite what most people would expect. I'm inclined to reverse everything. --Chinasaur 22:39, 19 May 2006 (UTC)

[edit] Cov Matrix size

The size of the cov matrix C is still unclear. From the session “Find the eigenvectors and eigenvalues of the covariance matrix” on, it is considered to be NxN, while in the session “Find the covariance matrix” it is MxM, which I think is the right size, since the matrix B is a MxN. 133.6.156.71 12:07, 6 June 2006 (UTC)

Shouldn't it read "inner product" instead of "outer product" as C as the outer product $B \cdot B^*$ would make it a $M\times N\times N\times M$ tensor?

[edit] This isn't really working!

The first point, I wondered about, is "Calculate the empirical mean". I think the mean is not calculated in the right way. The mean is calculated over each dimension M. Isn't that sophisticating the data. I think you have to take the mean over each observation (N-vector).
The second point is the size, first of the covariance and then the size of the eigenvalue-matrix. By calculating the eigenvalues you get one for each variable in the data set. So, the size of this matrix should be MxM. And before, to reach this result, the covariance Matrix must have the same size.
... Has anybody an idea how it's really working?

[edit] Whats the difference between PCA and ICA

Just wondering.. This ist clear to me for these articles? --137.215.6.53 12:18, 3 August 2006 (UTC)

[edit] Principle Components analysis verus Exploratory factory analysis

I suggest to include a subsection discussing the differences between PCA and exploratory factor analysis. Based on my experience in working in Stat Lab is that students/clients get them confused. Perhaps a description of the differences between PCA and EFA may be included. This can be added to common factor analyses. Below is my undertanding on the differences. I did not want to use "greek" symbols so that it may perhaps be more accessible to non-mathematicians. What do you think?

Exploratory factor analysis (EFA) and principal component analysis (PCA) may differ in their utility. The goal in using EFA is factor structure interpretation and also in data reduction (reducing a large set of variables to a smaller set of new variables); whereas, the goal for PCA is usually only data reduction.

EFA is used to determine the number and the nature of latent factors which may account for a large part of the correlations among a large number of measured variables. On the other hand, PCA is used to reduce scores on a large set of observed (or measured) variables to a smaller set of linear composites of the original (or observed) variables that retain as much information as possible from the original (or observed) variables. That is, the components (linear combinations of the observed items) serve as reduced set of the observed variables.

Moreover, the core theoretical assumptions are different for both methods. EFA is based on the common factor model (FA), whereas, PCA is not.

1. Common and unique variances

Common Factor Model (FA): Factors are latent variables that explain the covariances (or correlations) among the observed variables (items). That is, each observed item is a linear equation of the common factors (i.e., single or multiple latent factors) and one unique factor (latent construct affiliated with the observed variable). The latent factors are viewed as the causes of the observed variables.

Note: Total variance of variable = common variance + unique variance (in which, unique variance = specific + error variance).

Principal Components (PCA): In contrast, PCA does not distinguish between common or unique variances. The components are estimated to represent the variances of the observed variables in an economical fashion as possible (i.e., in a small a number of dimensions as possible), and no latent (or common) variables underlying the observed variables need to be invoked. Instead, the principal components are optimally weighted sums of the observed variables (i.e., components are linear combinations of the observed items). So, in a sense, the observed variables are the causes of the composite variables.

2. Reproduction of observed variables

FA: Underlying factor structure tries to reproduce the correlations among the items

PCA: Composites reproduce the variances of observed variables

3. Assumption concerning communalities & the matrix type.

FA: Assumes that a variable's variance is composed of common variance and unique variance. For this reason, we analyze the matrix of correlations among measured variables with communality estimates (i.e., proportion of variance accounted for in each variable by the rest of the variables) on the main diagonal. This matrix is called the Rreduced.

Note: Principal Axis factoring (PAF) = principal component analysis on Rreduced.

PCA: There is no place for unique variance and all variance is common. Hence, we analyze the matrix of correlations (Rxx) among measured variables with 1.0s (representing all of the variance of the observed variables) on the main diagonal. The variance of each measured variable are entirely accounted for by the linear combination of principal components.

Also See factor analysis

(please bare with me, I am new with using wikipedia).

RicoStatGuy 15:53, Sept 30, 2006(UTC)

[edit] Orthogonality of components

According to this PDF, the eigenvectors of a covariance matrix are orthogonal. The eigenvectors of an arbitrary matrix are not necessarily orthogonal, as seen in the leading picture on the eigenvector page. So what gives? Why are these eigenvectors necessarily orthogonal? —Ben FrantzDale 14:44, 7 September 2006 (UTC)

According to Symmetric matrix, "Another way of stating the spectral theorem is that the eigenvectors of a symmetric matrix are orthogonal." That explains that. 128.113.54.151 20:00, 7 September 2006 (UTC)

If the multiplicity of every eigenvalue of the covariance matrix is 1, then the eigenvectors will by necessity be orthogonal.

If there exists an eigenvalue of the covariance matrix with multiplicity greater than 1, say of dimension r, then this corresponds to an r-dimensional subspace of Rⁿ (n being the dimension of the covariance matrix). Then the corresponding eigenvectors can be in principle any basis of this subspace. But generally speaking, the basis is chosen to be orthogonal.

So to answer the question, in some cases they must be orthogonal, and in some cases they do not all have to be, but are usually chosen to be so.

On a side note, all software packages I am aware of will return orthogonal eigenvectors in the multiplicity case. I suspect that this is because the algorithms implicitly force this by recursively projecting Rⁿ into the nullspace of the most recent eigenvector, or something equivalent. Baccyak4H (talk) 17:56, 20 November 2006 (UTC)

[edit] Abdi

The following comes from my talk page:

" Hi Brusegadi, I deleted some references in principal component analysis related to an author called Abdi because they are neither standard references nor really related to the basic and strong theory expected of the article. Unless you can prove otherwise, I will take steps such that these references would not appear ever again. Can't you understand that this author is self-promoting or may be he is someone dear to you or you are Abdi. Mind your words! PCA_Hero "

PCA_Hero, the reason why I deleted your edit is becuase it looked like a case of blanking. You could have been able to tell that by the message I left on your anon talk page (note that this is the standard message prescribed by wiki to blanking vandals.) I see much blanking from anons. Had you provided an explanation like the one given in my talk page AFTER everything had happened in the ARTCLE'S talk page I would NOT have deleted that. Something we con both learn is WP:AGF since I accused you of blanking and you accused me of promoting spam. The difference is that when you look at your contributions, you see very few (as of today) but when you look at mine you see many and from a broad range of topics. I will also state that I do not appreciate your tone! Mind OUR assumtions! Brusegadi 19:51, 15 November 2006 (UTC)

[edit] Rows and columns

I think our convention for the Data matrix is probably the wrong way round. At the the of the day, it would probably be more natural if our "principal components" vector was a column vector.

I also think that confusion between the two conventions is one of the things that has been making the article more difficult than it needs to be.

I propose to go ahead and make this change, unless anyone thinks it's a bad idea ? Jheald 16:53, 31 January 2007 (UTC).

[edit] Percent variance??

Presumably one wants to compare the sum of the leading eigenvalues to the sum of all eigenvalues. The example of comparing to a threshold of 90% doesn't make much sense otherwise

[edit] Cumulative energy

This term for the contributions of components seems to be from some other field. Is there a better general term for this ? Shyamal 08:25, 12 March 2007 (UTC)

[edit] Terminology used

Many users of PCA expect certain terminology such as the decomposition into "loadings" and "scores". The term loading itself is never used in the article and this can be confusing. The following is a mechanical statement for PCA in Matlab.

For a dataset X we can use Eigenvalue decomposition to produce
1) An Eigenvector matrix V whose columns are Eigenvectors and
2) Eigenvalue matrix D (diagonal) such that
(X-D)*V=0
and X=V*D*inv(V)

depending on the algorithm the elements of D may be ascending, descending or in unsorted order, but the elements of D and the columns of V may be suitably sorted without change in the identities, Matlab for instance puts the D values in ascending order in the eig() function but descending is often preferred

If X consists of samples in rows and variables in columns, then 
X'X gives the covariance matrix if X is mean centered. 

PCA can be done on the covariance matrix or using X'X even without mean centering

Cov=X'X

Cov now is a square matrix with the dimension being the number of variables or columns

[V,D]=eig(Cov) in Matlab will now give the Loadings in V

The scores can be obtained with

Scores=X * V(:,m-k,m)

for k components, m is the number of variables

It can be verified that

X ≈ Scores * Loadings'

Instead of using the Covariance matrix X'X, one can also compute PCA using just the X matrix. Here the Singular Value Decomposition algorithm (SVD) may be used. In Matlab

[U,S,V]=svd(X)

The V here is identical to the V (loadings) obtained by Eigenvalue decomposition and the Scores are now equal to U*S

Hope someone can use the above suitably formatted in the article with explanation of the terms scores and loadings. Shyamal 09:15, 12 March 2007 (UTC)

[edit] Merge POD and PCA

These seem to be just different terms used in different circles/applications for the same thing. Jmath666 01:53, 16 March 2007 (UTC)

Agree. Shyamal 08:48, 16 March 2007 (UTC)

[edit] Request information on how to choose how many components to retain

Could someone include information on the proper method for choosing the number of components to retain? I've done some searching and haven't found any 'rules'. Part of my interest is that in MBH98 they retain only the first PC, but they apparently did incorrect centering. If it is correctly centered then a similar result is achieved if including the first 4 PCs. http://en.wikipedia.org/wiki/Hockey_stick_controversy It might be useful for individuals familiar with PCA add some details to the above link.

LetterRip 10:49, 22 March 2007 (UTC)

Of course after I post the question I find a good reference :)

"Component retention in principal component analysis with application to cDNA microarray data"

"Many methods, both heuristic and statistically based, have been proposed to determine the number k, that is, the number of "meaningful" components. Some methods can be easily computed while others are computationally intensive. Methods include (among others): the broken stick model, the Kaiser-Guttman test, Log-Eigenvalue (LEV) diagram, Velicer's Partial Correlation Procedure, Cattell's SCREE test, cross-validation, bootstrapping techniques, cumulative percentage of total of variance, and Bartlett's test for equality of eigenvalues. For a description of these and other methods see [[7], Section 2.8] and [[9], Section 6.1]. For convenience, a brief overview of the techniques considered in this paper is given in the appendices.

Most techniques either suffer from an inherent subjectivity or have a tendency to under estimate or over estimate the true dimension of the data [20]. Ferré [21] concludes that there is no ideal solution to the problem of dimensionality in a PCA, while Jolliffe [9] notes "... it remains true that attempts to construct rules having more sound statistical foundations seem, at present, to offer little advantage over simpler rules in most circumstances." A comparison of the accuracy of certain methods based on real and simulated data can be found in [20-24]."

http://www.biology-direct.com/content/2/1/2

LetterRip 11:11, 22 March 2007 (UTC)

Retrieved from "http://en.wikipedia.org../../../p/r/i/Talk%7EPrincipal_components_analysis_eb57.html"

Category: Wikipedia articles that are too technical