Saringan Bayes
Ti Wikipédia, énsiklopédi bébas
![]() |
Artikel ieu keur dikeureuyeuh, ditarjamahkeun tina basa Inggris. Bantosanna diantos kanggo narjamahkeun. |
Bayesian filtering nyaeta proses nu ngagunakeun metoda statistis Bayes keur klasifikasi dokument dina sababaraha kategori.
Bayesian filtering ilahar digunakeun sanggeus dijelaskeun dina paper "A Plan for Spam" ku Paul Graham[1], sarta jadi mekanisme nu kawentar keur ngabedakeun spam jeung serelek nu dipikahayang. Loba program serelek modern saperti Mozilla Thunderbird migunakeun Bayesian spam filtering.
Bayesian filters rely on the fact that particular words have different likelihoods of occurring across different categories. For instance, most email users will seldom see the word "Viagra" in legitimate email, but will encounter it frequently in spam email. To 'train' the filter, the user must manually indicate into which category a particular document belongs, and the filter will then assign a probability to each word in the email.
This probability indicates the likelihood that, in the absence of any other evidence, the document belongs in a particular category. For instance, most spam filter users will end up assigning a very high spam probability to the words "Viagra" and "Refinance", but a very high not-spam probability to words they only see in legitimate emails, such as the names of friends and family members. When all of the evidence is taken together and a final spam probability is computed, the filter will mark the email as spam if it is considered extremely likely to be such.
The advantage of Bayesian spam filtering is that it can be trained on a user-by-user basis. The spam a user receives often has some relevance (and therefore statistical clustering), as for instance placing a personal ad may increase the likelihood of receiving personal-ad-related spam. The legitimate email a user receives will also tend to have a significant amount of statistical clustering, as many of a person's coworkers, friends, and family members will choose to discuss related subjects, and therefore use similar words. Because these two sets of words are unique for each user, Bayesian spam filtering can potentially offer greater filtering accuracy.
While Bayesian filtering is most often used to identify spam, the technique can potentially be applied to classify any sort of document.
There are many good spam filters available. One of the most popular is PopFile which is available in sourceforge.net. This software is trained to differentiate between spam and legitimate mail and classify them accordingly.