Wikipedia Diskussion:WikiProjekt Georeferenzierung/Wikipedia-World/news
aus Wikipedia, der freien Enzyklopädie
The CSV file linked from this template is, as of 2006-12-29, encoded in a curious way. First, its original content was encoded in UTF-8: secondly, it was then decoded to Unicode as if the resulting byte-stream was encoded in Windows-1252: finally, the resulting Unicode was re-encoded as UTF-8. Finally the file was compressed in .zip format.
After unzipping, applying the following UNIX command-line filter program in Python will remove these layers of encoding, and output a valid UTF-8 encoding of the original Unicode content of the CSV file. I hope this is useful.
import sys, string def explodify(ch): if ord(ch) in [0x81, 0x8d, 0x8f, 0x90, 0x9d]: return chr(ord(ch)) return chr(ord(ch.encode("windows-1252"))) # Stuff has been multiply encoded: remove these layers while 1: text = sys.stdin.readline() if not text: break # Decode as UTF-8 text = text.decode("utf-8") # Rip off a layer of Windows-1252 -> Unicode text = string.join([explodify(ch) for ch in text], "") # What is left is a valid UTF-8 string: which is what we want sys.stdout.write(text)
Note: even after running this decoder, a considerable number of HTML entity escapes, both numeric and otherwise, and %XX urlencode()-style hex escapes remain in the source data fields. However, these are easily dealt with without any detective work on the encoding of the file, and can be dealt with after CSV decoding of the valid UTF-8 data produced by this filter.
-- The Anome 19:16, 29. Dez. 2006 (CET)
[Bearbeiten] Database- Structure
lang varchar(10) utf8_bin Nein Titel varchar(180) utf8_bin Nein Titel_en varchar(90) utf8_bin Ja NULL Titel_de varchar(90) utf8_bin Ja NULL Titel_es varchar(90) utf8_bin Ja NULL Titel_fr varchar(90) utf8_bin Ja NULL Titel_it varchar(90) utf8_bin Ja NULL Titel_ja varchar(160) utf8_bin Ja NULL Titel_nl varchar(90) utf8_bin Ja NULL Titel_pl varchar(90) utf8_bin Ja NULL Titel_pt varchar(90) utf8_bin Ja NULL Titel_ru varchar(200) utf8_bin Ja NULL Titel_sv varchar(90) utf8_bin Ja NULL lat double Nein 0 lon double Nein 0 type varchar(10) utf8_bin Nein pop double Nein 0 Height double Nein 0 Country varchar(10) utf8_bin Nein Subregion varchar(10) utf8_bin Nein Scale varchar(10) utf8_bin Nein psize double Nein style varchar(12) utf8_bin Nein t varchar(10) utf8_bin Nein image varchar(160) utf8_bin Nein