Visualizing Wikipedia Editors

Introduction

I was recently contracted by my friend, Morgan Currie, in the new media masters program at the University of Amsterdam, to construct visualizations of Wikipedia editing activity across the articles pertaining to Feminism, a topic subject to intense debate and occasional vandalism. Her aim was to analyze Wikipedia’s validity as a source in scholarly research. Her results may be found in this PDF.

All scripts used to generate this project are avaiable as a gist from GitHub.

Methods of Data Aquisition

Online Tools

Websites with access to mediwiki’s apis were used to retrieve csv files for import into OpenOffice.org where data analysis was performed.

MediaWiki Api

Although no direct access to the WikiMedia’s API was used in data collection for this project, the following example in ruby shows how to retrieve full edit history for a specific author, given the timestamp of a recent edit as listed on their user page, into a csv file for easy import into a spreadsheet.

Nodebox Visualization Script

Here is a script for Nodebox to generate a pretty visualization from a pickled python array created from some mystical search and replace fu of a csv file.

Initial Project Concept by Morgan Currie

Pages Analyzedse

Top 15 (plus feminism = 16 total)

  1. feminism (already scraped)
  2. men’s rights
  3. liberal feminism
  4. anarcha-feminism
  5. antifeminism
  6. women’s rights
  7. ecofeminism
  8. radical feminism
  9. men and feminism
  10. sex-positive feminism
  11. history of feminism
  12. cultural feminism
  13. marxist feminism
  14. feminist theology
  15. women’s suffrage

Top 5 (for II)

  1. Feminism (already scraped)
  2. radical feminism
  3. antifeminism
  4. women’s rights
  5. women’s suffrage

Viz Method

I. How much editing activity on the discussion page occurs in top 15 articles?

Est time: 1-1.5 hour

example

  1. Go the first article and click on the history page (you’ll see the history tab on top of the article’s page). Copy the URL of the history page.
  2. Copy and paste the data chart into an excel sheet and look at how many lines there are (to show number of edits). Do this for each article and make separate spreadsheets of editing history for each article after you’ve scraped it
  3. make a separate spreadsheet showing article title in one column and number of edits for each article in another.
  4. Register for Manyeyes (it doesn’t take any time). Start a data set with the spreadsheet showing how many edits per article. Use the [bubble chart to visualize](http://manyeyes.alphaworks.ibm.com/manyeyes/page/Bubble_Chart.html)

II. How much editing history by month occurs in top five articles?

More time consuming, probably will take 2 hours to get data and more to visualize.

example

  1. Look at your spreadsheets of editing history for the top five articles (I’ll indicate them). Count how many edits happen each month for each article (I just cut and paste all the rows from a month into a fresh spreadsheet and see how many lines there are)
  2. Put this information in a separate spreadsheet. You should have five charts that show how many edits happened across 12 months for each article
  3. Optional: visualize this somehow like this http://wiki.digitalmethods.net/pub/Dmi/ThePlaceOfIssues/Evolution.png (You can make bubbles with this tool, though I’m not sure how to get the different shades of red: http://tools.issuecrawler.net/beta/bubbleline/) – or maybe you can get photoshop to make circles in different shades?

IV. Map the Users. Shouldn’t take too long, maybe 1.5 hours to get data.

example

  1. Go back to the history page URL for all 15 articles. click on ‘revision history statistics.’ This will give you the user name for the top editors of the articles.
  2. Get the names of the top 100 editors for each article – have their name in one column, the article they edit in in one column, and number of edits in another. Find the top 20 most active across all 15 articles by ordering spreadsheet alphabetically per the username column.
  3. Optional: visualize this somehow (same way as II – a chart of bubbles.)

V. Percentage of Bots across all articles. 1.5 hours

example

  1. Go back to the history page for all 15 articles. Search for ‘bot’ in the user column.
  2. Start a chart showing how many bots per article.
  3. calculate percentage of bot activity compared to total editing activity
  4. Use the [bubble chart to visualize: ](http://manyeyes.alphaworks.ibm.com/manyeyes/page/Bubble_Chart.html)
blog comments powered by Disqus