Digital libraries save up Canadian content, Tweets, and everything else

By: Lee Rickwood

April 23, 2010

Some folks may not like this very much – I hope you’re not one of them.

It seems everything on the Internet being saved, and it’s all searchable, too.

Archive_ITThere have been a number of recent announcements about archiving content from the Internet, and some new gargantuan efforts by public organizations and private companies alike to make sure nothing online gets lost.

The IT History Society – you knew about them, right? – is launching a searchable archival database of information technology resources, including historical and archival collections.

Then, there’s the on-going work at the University of Toronto Library, where nearly a quarter of a million books have already been digitized, with plans to preserve 250,000 more in the archive.

And now comes word (and you may find this somewhat more timely and immediate depending on your Twitter account status) that the United States Library of Congress will archive all public Tweets!

Google says it will search that archive.

That’s right. Every public tweet, ever since Twitter’s inception in March 2006, will be archived – what a treat! Apparently, more than 50 million tweets are sent each day, so the archive will number in the billions.

The Library of Congress’s blogger Matt Raymond made the initial announcement, promising more details and an official release soon. In the meantime, he wrote, “[I]t boggles my mind to think what we might be able to learn about ourselves and the world around us from this wealth of data.”

Yeah, well, me too! I am boggled at the thought.

archiveorg_logoPerhaps more mind-boggling is the Library’s complete digital collection, which now adds up to some 167 terabytes of data. That, and the National Digital Information Infrastructure and Preservation Program are dedicated to collecting, preserving and making available information that is created in digital form only.

That mandate is certainly understandable, and it echoes the goal at other Internet collections, such as the IT History Society’s Archive, maintained so that “this unique information will add to the future research of the information industry” … “that has had the most impact on mankind in the shortest time frame.”

The database today consists of 233 international information technology historical and archival collections encompassing over 2.4 million documents (small compared to the Library of Congress, of course with just 195 gigabytes of stored data. The database is expected to nearly double over the next 5 years. 

With a partnership with Archive-IT of the Internet Archive, all of the 233 sites will be crawled and text indexed every 30 days for full keyword search ability. 

utl_logoMeanwhile, did you know we all can access, with an Internet connection, the U of T’s digitized material, as well as material from (among others) the libraries of Penn State, University of California, The British Library and the Boston Library Consortium at the Internet Archive?

The folks on the U of T library team are now sending like one terabyte of data, in the form of scanned books, every day to the Internet Archive.

There are some 18 high-end scanning stations there, working some 16 hours per day to get the job done!

Because that size of data transfer would just not be practical or feasible on the commercial Internet, potentially bringing it down, the U of T is working with the super fast broadband backbone network from CANARIE, Canada’s Advanced Research and Innovation Network, which manages an ultra high-speed private network described as, hundreds of times faster than the Internet.

Now, if it can just catch up to that message I didn’t really mean to send……

submitted by Lee Rickwood

# # #

What about you? Have you sent something online you’d rather not have archived? Tell us about it ha ha 😉

While you’re thinking it over, check out more of WhatsYourTech.ca’s coverage of privacy.


Leave a Reply

Your email address will not be published. Required fields are marked *