DigitalNZ indexes a lot of stuff. As I type this blog post, it's midday on Friday, November 4, 2011. If I visit the DigitalNZ website and search for every item we index, I get 25,661,157 results. That's a mind-boggling number.
In preparation for a talk I'm giving at the National Digital Forum in a few weeks, I am trying to understand what exactly are the 25 million pieces of New Zealand digital content we link to. I've been asking myself questions like "How are these things distributed over geographic space?" and "What aspects of life in New Zealand are represented by this material?" In order to answer these questions I have been using the DigitalNZ developer API to interrogate our metadata repository.
One of the first questions I had was "What is the temporal distribution of publication dates for the various images, articles, videos, manuscripts and so on that DigitalNZ indexes?" It turned out the DigitalNZ API makes this question easy to answer.
The histogram at the top of the post displays the volume of items we index organised by the year the content was originally published. I've restricted the figure to only show items that were created between 1840 and 2011.When reading these charts you should keep in mind that not every item of content that DigitalNZ indexes has a publication date in its metadata. If we don't know when the item was originally published then it is not represented in the charts below.
Here is a very high-level account of how I interpret the chart above. DigitalNZ has metadata describing a large volume of digital content that builds steadily from the 1850s, reaching a peak in the early 20th Century, before dropping away in the mid-1940s. We can see another rise in content beginning at the start of the 21st Century. I was a bit confused as to what the mountain of data represents but when I showed this chart to a few DigitalNZ colleagues and they each said, "Oh, that data mountain is obviously Papers Past".
Papers Past is is a collection of digitised articles and images from New Zealand newspapers and periodicals published between 1839 and 1945. It is an amazing treasure and by far the largest set of items indexed by DigitalNZ. The second chart visually distinguishes between Papers Past and all other items DigitalNZ collects metadata on that has a publication date.
I created these charts using the DigitalNZ developer API. To get the data for the first chart I constructed a query that looks like this:
To access the DigitalNZ API requires a key which you append to all of your calls (i.e. the bit where it says [YOUR_API_KEY] ). You can grab a key here. This API call asks DigitalNZ for a JSON file that provides a count for every record summarised by year facets. It should be noted that that API allows you to swap "json" for "xml" if you are so inclined.
The second call that I needed to make looked like this:
This query is very similar to the earlier search except it is restricted to just data from the Papers Past collection. Once I had this data under my belt, I simply transformed the JSON into a spreadsheet and produced a stacked histogram.
You can learn more about the DigitalNZ API by reading the developer documentation. Feel free to get in touch with me if you have questions about any aspect of it. I'm always keen to discuss ideas that people might have or to figure out ways to overcome hurdles you might face.
Comments have been closed for this post
It would really interesting to actually see, at a broad level, the sorts of info people are accessing, or searching for. Is there an equally high correlation of hits on Papers Past info through DNZ?
Thanks for your comment, Anne. That's a really interesting question and we are keen to get our head around the different patterns of use. I think I will have to look into it.