Wikipedia:Version 1.0 Editorial Team/MartinBotII

From Wikipedia, the free encyclopedia

Shortcut: WP:MBOT

1 The original proposal
2 Initial test
3 Proposed full-scale test
- 3.1 Option A
- 3.2 Option B
4 Possible problems
- 4.1 Importance

User:MartinBotII is being used to sift through articles tagged by WikiProjects in order to find articles suitable for inclusion in offline releases.

[edit] The original proposal

We are looking ahead to 2007 releases, and we'd like to generate lists of articles pulled from our existing WikiProject lists. This is something we've talked about for some time, you can see an example of a recent suggestion/discussion here. I'd ask Mathbot, but Mathbot is already pretty tied up with the existing work (over a third of a million articles right now, and growing), and I don't want Oleg to have to add another level of complexity to it. In effect I'm asking for a bot to trawl through the lists Mathbot generates, to produce its own lists (much shorter). I envisage something like this:

Go through a specific set of WikiProject article worklists (say, from a set of Science project lists), and pull out all articles of a certain minimum quality standard and importance standard. This might be done according to some formula, or it may just be "All articles ranked B or better, and mid-importance or better"
(Optional) Check for any POV tags or other red flags on the article pages.
Generate a worklist, simple alphabetical list, log and statistics from the above data.
Repeat the run once a week.

That way we might be able to generate quickly a list of all the natural science articles suitable for publication. I imagine the code for this could be quite similar to Mathbot, with a few changes. I would expect it to work through WikiProject Set A, then Set B, then Set C, etc, until it completes all of our sets of WikiProjects.

I think we would generate WikiProject sets manually, based on subject category (as listed at {{V0.5}}) and also how high up the tree the project comes. Thus we might have Arts at the top of the tree, then Music, then Music genres, then Electronic Music, then the KLF. I don't think we'd apply a simplistic formula for ranking these, we would take each one on a case-by-case basis, because we'd also have to take into account how that project assesses for quality & importance, to compensate for project biases. In other words we don't want project A getting all their articles included on our DVD simply because they tagged everything as Top-importance and B or better.

[edit] Initial test

A pilot trial was conducted using articles from chemistry, physics, medicine and mathematics, and this generated the results accessible here. This pilot trial used the simple algorithm described here:

Quality:

FA-Class is 7.5
A-Class is 6.5
GA-Class is 5
B-Class is 4.5
Start-Class is 4
Stub-Class is 2

Importance:

Articles that are needed for completeness will have their importance rating doubled
Top-importance is 7.5
High-importance is 6
Mid-importance is 4
Low-importance is 2.5

The rating of an article is its quality rating times its importance rating. Articles which have a rating of 20 will automatically be included in the release version. (This includes top importance that are at least start class, high-importance that are at least B-class, mid importance that are at least GA-class; and no low importance).

Score			Importance
			Top	High	Mid	Low
			7.5	6	4	2.5
Quality
	FA	7.5	56.25	45	30	18.75
	A	6.5	48.75	39	26	16.25
	GA	6	45	36	24	15
	B	4.5	33.75	27	18	11.25
	Start	4	30	24	16	10
	Stub	2	15	12	8	5

NOTE: the minimum rating can be increased to get better articles (at the expense of quantity) or decreased to get more articles (at the expense of quality).

The test was considered to be a great success - it generated a very viable selection of articles.

[edit] Proposed full-scale test

One major problem remains before the bot can be used to generate large lists; we need to be able to reliably judge importance across a wide spectrum of articles. We will need to compensate for (a) project importance level (e.g., USA vs. Texas vs. Dallas) and (b) assessment practices at the project concerned (e.g., depending whether they have 1 or 100 "top"). The latter isn't a problem now, but a few projects may try to cram lots of their articles into our releases if there isn't a check built in from the start.

Two (more refined) algorithms have been proposed:

[edit] Option A

See full discussion here. This approach takes the multiplication scale outlined above and refines it, to take account of importance of the individual WikiProject.

2.2, world in general (if there is one)
2, continents
1.9, regional blocs like the European Union
1.8, major countries (top fifth percentile GDP)
1.8, science, history, arts, entertainment, etc. (general projects)
1.6, moderately important countries (40th-80th percentile GDP or bigger than and including Iraq)
1.3, minor countries (everything else larger than and including Andorra
1, global cities
0.9, each area of science, history, sports, etc. (major, i.e. Chemistry or Football) (definition of major? >=90,000,000 Google hits?)
0.8, each area of everyday life (major, i.e. Train or Trees, singular, >150 million Google hits)
0.7, each area of science, history, sports, etc. (minor, i.e. Developmental psychology, everything not major)
0.7, tiny countries like Monaco and major cities (1,000,000+ population or a global city, debatable)
0.5, TV shows (major, >=12,500,000 Google hits, after searching for MythBusters, Oprah, and CNN)
0.4, minor cities (not major)
0.2, TV shows (minor, i.e. not major)

The importance of a Wikiproject's articles would be multiplied by its importance rating to get the final rating. These numbers could be tweaked a bit, though. Feel free to edit it without posting a new message. Importance in this case is defined by Google hits (i.e. roughly how many people know about it). There is now a page for rating the importance of projects at Wikipedia:Version 1.0 Editorial Team/Work via Wikiprojects/Importance because we will need to rate them sometime.

[edit] Option B

Each article is given a score, based on an additive scale. This is designed to allow for fine tuning, while keeping the numbers as integers rather than fractions. The aim is that for Version 0.7, any article with a score equal to or over 1000 would be included. The algorithm should be applicable to any article in Wikipedia once fully established. The system weights importance more than quality.

The proposal would be to run the bot first without the "correction factors" to see how well it works, then phase in correction factors as needed (if needed). We can also tweak the numbers as needed, based on the initial results.

The formula would be

Score = Q' + I' + P'

where Q' = corrected article quality, I' = corrected article importance, and P = corrected project ranking.

The more detailed version of this is

Score = (Q + QC) + (I + IC) + (P + PC)

Where Q, I and P represent the uncorrected quality, importance and project ranking, and QC, IC, PC represent the corrections to those ranks. The corrections are for each WikiProject

Quality of article: Q' = Q + QC

In addition to a basic "raw quality score" (Q) of

FA 300
GA/A 250
B 200
Start 150
Stub 80

there would also be a correction factor QC for any given project:

If a% of tagged articles from that project are A (not GA) and b% of articles are B,
then the quality score is reduced by (10a+b).
Thus QC is negative and is used to correct for any "grade inflation."

If a project really does have a lot of very good articles,
this negative correction factor will be more than compensated for
by the positive factors (PC) added to the project rank.

Importance

This might be based upon

The importance of the article topic (I') as judged by a relevant WikiProject (I'), in conjunction with the importance of that WikiProject's general subject area, called the project ranking (P').
The importance of the article topic as judged by other parameters, such as the number of mainspace links to that article, and perhaps looking at the importance of those links.

To keep things relatively simple at this point, only the first of these parameters will be considered; for a more detailed description of the second proposal see this outline.

Importance of article (I')

I' = I + IC

As well as a basic "raw importance score" (I):

Top 550
High 400
Mid 250
Low 100

there would also be a correction factor IC for each project:

If x% of tagged articles from that project are Top,
y% of articles are High, and z% of articles are Low,
then the importance correction factor is (-10x - y + z).
Thus IC is usually negative and is used to correct for any "grade inflation."

WikiProject ranking (P')

P' = P + PC

In addition to a basic "raw project rank" (P) of (detailed description to be determined)

Top-level in hierarchy (e.g., History) 400-500
High-level in hierarchy (e.g., History of Poland) 300-400
Mid-level in hierarchy (e.g., Kings and Queens of Poland) 200-300
Low-level in hierarchy (e.g, WikiProject:Mieszko I of Poland) 100-200

there could also be a correction factor PC. A project would have a positive PC added, designed to "reward" a project for being active and for writing good quality articles. This is needed in order to correct for the fact that some subject areas may have more thorough coverage or better articles than do others. For example, WP:MILHIST covers just about 2% of Wikipedia articles, but has about 10% of all the Featured Articles. The correction factor would be derived from:

Number of articles tagged
Number of FAs and GAs
Number of participants (perhaps)
Number of links to the project (excluding article talk page tags) (perhaps)
Talk page activity (perhaps)

Details of project ranking might be refined here.

[edit] Possible problems

[edit] Importance

Importance has proved to be a thorny issue, with many projects electing to avoid the issue altogether by only tagging for quality. In the 1.0 project we have seen this first hand, as editors take offence when told that their favourite FA is not important enough to be included. This problem is likely to be even more serious when the relative importance of projects is being debated. This issue cannot be ducked, however; if we want to produce a broad selection of Wikipedia yet be selective, importance is probably the main criterion used for making that selection.

The ideal way to assess importance is (as with quality) to let the experts (the WikiProjects) do this themselves. This simplifies the problem, but each WikiProject then needs to be ranked for its importance. This is likely to be about as popular as trying to put down hard figures for the Value of life. As such, we must try to find fairly objective ways to do this - the correction factor helps with this, but better would be to find some external ranking schemes. Another problem is that most articles on Wikipedia at present don't have any assessment for importance - though this might begin to change if people see that it helps get their articles included in Version 1.0.

Another way to approach the problem is to try and rank individual articles ourselves using objective criteria, possibly using one of the two methods described here, summarised as:

Just a simple number - e.g., "346 articles link to this article".
A more complex algorithm that would factor in the importance of those 346 articles, perhaps via iteration. I imagine a first run that ignores this factor, then later runs phase it in gradually. You would have the bot read a table called (say) ImportanceOld that contains the full listing of importance numbers generated on the previous run, and use that in generating ImportanceNew; at the beginning of the next run the bot would copy ImportanceNew over ImportanceOld.

Some other machine based ranking methods include:

Number of hits on each article, as is used for the Wikicharts.
A count of interwiki links, i.e., how many different language Wikipedias have that article, as was used in this discussion.

It may be possible to have the bot use all of these methods, in which case the importance rating should be even more reliable.

Retrieved from "http://en.wikipedia.org../../../v/e/r/Wikipedia%7EVersion_1.0_Editorial_Team_MartinBotII_3f0b.html"

Category: Wikipedia release version work