Talk:Data clustering
From Wikipedia, the free encyclopedia
Contents |
[edit] Sabotage
This page appears to have been deliberately vandalised.
[edit] Elbow criterion section
Someone should edit out the blatant advertising for the excel plugin product. It would also be nice to have some additional info on picking the number of clusters?
- Sorry, I just had Excel at hand when making the graph. This is not an advertisement, this is not real data, but a crappy Excel hand made graph to help visualize the EC heuristic. If you look through the archives, you will see I also made an extremely ugly representation for the hierarchical clustering, which was thankfully replaced.
- As for how to tweak clustering, either you have a good idea of how many clusters you want (number criterion), or a good idea of how much total variance (or another perf metric) you want to explain, or all you are left with is heuristic. That's from a Computer science POV.
- Looking in another direction, for example statistics, there are ways to compare models with a different number of parameters, like Akaike information criterion and the methods linked from there. It will again help you choose in the tradeoff between having too many cluster or having too low perf because of low cluser number. I'm sorry, I don't have any relevant article nor the time at the moment to find one. Bryan
[edit] V-means clustering
A Google search for "V-means clustering" only returns this Wikipedia article. Can someone provide a citation for this?
for future ref, this is the V-means paragraph that was removed
[edit] V-means clustering
V-means clustering utilizes cluster analysis and nonparametric statistical tests to key researchers into segments of data that may contain distinct homogenous sub-sets. The methodology embraced by V-means clustering circumvents many of the problems that traditionally beleaguer standard techniques for categorizing data. First, instead of relying on analyst predictions for the number of distinct sub-sets (k-means clustering), V-means clustering generates a pareto optimal number of sub-sets. V-means clustering is calibrated to a user-defined confidence level p, whereby the algorithm divides the data and then recombines the resulting groups until the probability that any given group belongs to the same distribution as either of its neighbors is less than p.
Second, V-means clustering makes use of repeated iterations of the nonparametric Kolmogorov-Smirnov test. Standard methods of dividing data into its constituent parts are often entangled in definitions of distances (distance measure clustering) or in assumptions about the normality of the data (expectation maximization clustering), but nonparametric analysis draws inference from the distribution functions of sets.
Third, the method is conceptually simple. Some methods combine multiple techniques in sequence in order to produce more robust results. From a practical standpoint this muddles the meaning of the results and frequently leads to conclusions typical of “data dredging.”
[edit] Fuzzy c-means clarification
I believe ther is a typo at "typological analysis"; should be "topological"
The explanation of the fuzzy c-means algorithm seems quite difficult to follow, the actual order of the bullet points is correct but which bit is to be repeated and when is misleading.
"The fuzzy c-means algorithm is greatly similar to the k-means algorithm:
- Choose a number of clusters
- Assign randomly to each point coefficients for being in the clusters
- Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
- Compute the centroid for each cluster, using the formula above
- For each point, compute its coefficients of being in the clusters, using the formula above"
Also aren't c-means and k-means just different names for the same thing, in which case can they be changed to be consistent throughout?
The c-means clustering relates only to the fuzzy logic clustering algorithm. You could say that k-means is teh convergence of c-clustering with ordinary logic, rather than fuzzy logic.
[edit] On-line versus off-line clustering
Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary
Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)
- Another division would be Extensional vs. Conceptual Clustering. --Beau 10:40, 28 June 2006 (UTC)
[edit] Missing reference
I find this in the article:
-
- This is the basic structure of the algorithm (J. MacQueen, 1967):
But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)
[edit] Impossibility Theorem
The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:
- Richness - "all labelings are possible"
- scale invarience
- Consistancy - "if you increase inter-cluster distance and decrease intra-cluster distance, cluster assignments should not change"
No clustering algorithm can have all 3.
-- BAxelrod 03:53, 16 December 2005 (UTC)
It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.
However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.
First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?
Could you please explain something about Isodata Algorithm for data clustering
The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan
[edit] Relation to NLP
It seems a large amount of the effort in text mining related to text clustering is left out of this article, but it seems to be most appropriate place. Josh Froelich 20:16, 9 January 2007 (UTC)
[edit] No strict definition for the problem itself.
There are lots of details about different methods and metrics used to solve the problem, which is defined too unstrictly.
[edit] All software links were removed
Here they are, should we put them back ?
[edit] Software implementations
[edit] Free
- The flexclust package for R
- COMPACT - Comparative Package for Clustering Assessment (in Matlab)
- YALE (Yet Another Learning Environment): freely available open-source software for data pre-processing, knowledge discovery, data mining, machine learning, visualization, etc. also including a plugin for clustering, fully integrating Weka, easily extendible, and featuring a graphical user interface as well as a XML-based scripting language for data mining;
- mixmod : Model Based Cluster And Discriminant Analysis. Code in C++, interface with Matlab and Scilab
- LingPipe Clustering Tutorial Tutorial for doing complete- and single-link clustering using LingPipe, a Java text data mining package distributed with source.
- Weka : Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
- Tanagra : a free data mining software including several clustering algorithms such as K-MEANS, SOM, Clustering Tree, HAC and more.
- Cluster : Open source clustering software. The routines are available in the form of a C clustering library, an extension module to Python, a module to Perl.
- python-cluster Pure python implementation
[edit] Non-free
- Clustan
- Peltarion Synapse (using self-organizing maps)[1]
- Eudaptics Viscovery: Data Mining Suite for Visual Cluster Analysis
- I removed these, about a month ago. Wikipedia's external link guidelines discourage large link directories like this. None of the links appear to meet the guidelines, each is just an instance of data clustering software with nothing to indicate it is especially notable to the topic. Normally I would have replaced them with a DMOZ link, but there doesn't appear to be a category for this. -SpuriousQ (talk) 10:14, 22 March 2007 (UTC)
- Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation) —The preceding unsigned comment was added by 86.53.54.179 (talk • contribs).
- I understand that, but that list was just an invitation for spam and links to data clustering software someone wrote one day. WP:EL is clear that links should be kept to a minimum. I would probably have no problem with links to clearly notable implementations; for example, one that was the subject of a peer-reviewed published paper or one created by noted researchers in the field. -SpuriousQ (talk) 15:12, 28 March 2007 (UTC)
- Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation) —The preceding unsigned comment was added by 86.53.54.179 (talk • contribs).