Geographically densest Wikipedia coverage

Wikipedia articles can be tagged with latitude/longitude coordinates. I was recently curious to know: which areas have the most coverage? It's important not to read too much into the answer, because the density of coordinates is due to a mixture of: how active different Wikipedia language projects are, how active at geo-tagging they are, which regions have had lots of short articles mechanically imported (e.g. on small towns, or metro stations), and finally, the actual landmark density (e.g. dense urban cores versus sprawling suburbs). But nonetheless it might be interesting to know.

So, here are the most densely Wikipedia-article-populated parts of the world, at several scales.

Diameter of 1 km (0.6 mi)

A diameter of 1 km finds very compact city centers, or particularly dense areas of larger cities. (Click on a city's name to recenter the map.)

Florence, Italy, city center (405 points)
Stockholm, Sweden, old town (388 points)
Vienna, Austria, city center (238 points)
London, UK, West End vicinity (206 points)
Stockholm, Sweden, Hötorget, a bit north of the old town (194 points)
Oxford, UK, city center (171 points)
Kraków, Poland, city center (163 points)
Rome, Italy, Palatine hill / forum (158 points)
Venice, Italy, San Marco district (158 points)
London, UK, the City of London (154 points)

London and Stockholm each show up here twice. Stockholm appears to have very dense coverage in its downtown area, while London is quite large and densely covered in many parts of the city. Italy also has three compact historical city centers in the top 10. All in all, ten out of ten are in Europe.

Honorable mentions for the top North American and South American locales:

New York City, USA, midtown Manhattan (48th & 7th) (130 points)
Buenos Aires, Argentina, city center (Plazo de Mayo) (96 points)

Diameter of 10 km (6 mi)

Moving up to a 10-km diameter gives us a small city's metropolitan area, or a large city's urban core.

London, UK (2645 points)
Paris, France (2313 points)
Stockholm, Sweden (2217 points)
Moscow, Russia (1796 points)
New York City, USA (1695 points)
Vienna, Austria (1540 points)
Berlin, Germany (1505 points)
Oslo, Norway (1300 points)
Florence, Italy (1259 points)
Brussels, Belgium (1150 points)

Now major cities with a strong urban core but no single 1-km city center come to the top of list: Paris, Moscow, New York, etc. Again mostly European cities, with the single exception of NYC.

Honorable mentions for the top locales on continents other than Europe and North America:

Tokyo, Japan (882 points)
Buenos Aires, Argentina (527 points)
Jerusalem, Israel (516 points)

Diameter of 100 km (62 mi)

Finally, an 100-km diameter is the size of a large metropolitan area, or an urban agglomeration.

London and northwest environs, UK (10541 points)
Andorra and rural Catalonia, Spain (9067 points)
Cologne-Düsseldorf-Dortmund (Rhine-Ruhr area), Germany (8782 points)
New York City and environs, USA (7010 points)
Amsterdam-Rotterdam-Utrecht-The Hague (the Randstad), Netherlands (6808 points)
Manchester-Liverpool, UK (5590 points)
Paris and environs, France (5206 points)
Barcelona and rural Catalonia, Spain (5186 points)
Tokyo and environs, Japan (5037 points)
Boston and environs, USA (4849 points)

Here we get a few urban agglomerations that didn't make the previous lists individually, but do when combined, most prominently two European agglomerations that geographers have already given names to: the Rhine-Ruhr region of Germany, and the Randstad of the Netherlands.

In addition, two odd entries cover large swathes of rural Catalonia. While these each have an anchor point (Andorra and Barcelona, respectively), a large number of the points in the area are actually in rural Catalonia, and the articles only exist on the Catalan-language Wikipedia, apparently a result of an effort to import articles on nearly every village and geographical feature in the region.

Some details on the method

The simplest way of determining the areas of greatest coverage would have been to take the latitude/longitude coordinates and bin them with some bin size. For example, with a bin size of 1 degree in each direction, the coordinate 40.6° N, 22.9° E would go into a bin bounded by 40-41° N and 22-23° E. Then we'd just sort the bin counts.

This has two problems though. The bins don't have the same area in different parts of the earth, since lines of longitude converge towards the poles. The difference is enough to be fairly significant when comparing, say, Cairo versus Copenhagen (this effect is what makes the familiar Mercator map projection distort land sizes towards the poles). Also, the boundaries between bins are arbitrary, so clusters of points that happen to straddle a boundary will get diluted between two (or even possibly four) different bins.

There are possible hacks to mitigate the problems. We could address the first by making the bin width in degrees vary with latitude. To fix the second, we could scan a window across the globe, moving it in increments smaller than the actual size of the window. The main advantage of the binning method, though, was simplicity and ease of computation. Once we're scanning a non-fixed-size window across the world, it's about as easy to do something better directly.

A more direct definition of the most-article-populated area: what is the point on the globe that has the most coordinates within a distance of n kilometers from the point? Since the earth is roughly a sphere (close enough for our purposes), by distance we mean great-circle distance.

Technical aside: We can even retain most of the computational efficiency of the latitude-longitude binning approach with an optimization: For a given candidate point, we can compute a lat-lon bounding box that lets us quickly exclude most points, and which also lets us use spatial indexes based on rectangular regions.

That leaves how to choose the candidate points. Conceptually, we want to check every point on the earth's surface, all infinity of them. An approximation is to scan a point across the globe with some step size. This spends a lot of time determining that empty circles are empty, though, especially if we use a small step size. A heuristic improvement we'll use is to use each coordinate in our data set as a candidate for a circle center; that gives us an adaptive step size that tries more finely spaced candidates in the denser parts of the data set, and more coursely jumps through the less-dense parts (trying nothing at all in the parts that have no data points). Finally, we remove circles that overlap with a circle that had a higher count; otherwise the top 10 circles would all be slight offsets of each other.

Acknowledgments

Data from Wikipedia-World as of the March 2011 database dump. Further preprocessed to only include one coordinate per article; otherwise the results are dominated by list articles that include hundreds of coordinates each. Maps from OpenStreetMap, overlays using Polymaps, and heavy lifting by Perl.