Friday 25 January 2013

robots.txt retroactively removes content from Wayback Machine

There is a fun experiment anyone can try: create some unique information and store it on two kinds of media: print it on a sheet of paper and store it in a file on a USB pen drive. Then, put both these objects on a hard solid surface, like a concrete floor. Take a hammer and hit the paper hard once. Then, hit the body of the USB pen drive with the same force. Next, try to recover the information from both media. The paper may have a hole in it, and perhaps it is impossible to read a few words. The pen drive however, is likely to be a total loss. If the silicon chip is cracked, your only chance would be to bring it to a specialized laboratory which will charge you a fee you cannot even imagine just to have a tiny chance of recovering perhaps a few words from the text.
What I am saying with this whole story is that I laugh at every advert that claims “save your old photos by scanning them with our digital photo scanner!” The easier it is to create and replicate information, the easier it also is generally to lose it. I am certain that at some point in history, there will be something like a super-sized version of that hammer, hitting our fragile digital archives. If it is bad enough, humanity will be catapulted back to medieval times and history will be a black hole starting from around the year 2000. Maybe they will believe the world really went to hell at the end of 2012. If you have something you really want to preserve, make a hard copy of it. No, make as many copies of it as possible, on all kinds of media.
Now, this whole introduction serves to illustrate how grave a certain issue is with the Internet Archive's “Wayback Machine”. The Wayback Machine is a great initiative. Its goal is to create digital archives of old websites. I once believed that once a website was archived, it would stay accessible until either the whole Wayback Machine were destroyed, or someone explicitly asked the information to be deleted. Now however I have discovered that information can disappear also in a very trivial and dumb way.
If someone places a ‘robots.txt’ file on a domain that prohibits crawlers from retrieving the domain, the Internet Archive will retroactively apply this prohibition. There is logic behind this: if someone noticed that a confidential website has leaked and has been archived in past months, this system allows to remove the archive without much fuss. The mechanism however is dumb as a brick and if a domain expires and is subsequently bought by someone who has no rights whatsoever to the original content, they can still put anything in the robots.txt to retroactively remove anything from the archive.
Proves that there are many domain name squatters who buy old domains and place a prohibitive robots.txt on the empty “for sale” page because they do not want it to litter search engines, which is actually a good thing. What is bad however is that this instantly hides the entire archive in the Wayback Machine. There is no justification for this aside from laziness of the programmers and excessive prudence. The squatter has no rights whatsoever to influence the information that was stored on the old website, he only has bought a domain name. Therefore I would greatly appreciate it if the people responsible for the Wayback Machine would implement a better way to provide a balance between legal concerns and the completeness of their valuable archive.

Friday 4 January 2013

Approximating iTunes DJ in iTunes 11

iTunes 11 has generated both praise and revolt in the online community because it is a mix of both improvements and regressions. One of the most annoying things for me and many others is the omission of the iTunes DJ feature. Apple's idea was to replace it with the “Up Next” feature, but it is not the same. What I liked a lot about iTunes DJ is that it was a regular playlist that I could easily tweak and I could see all the desired information about upcoming songs at a glance. The Up Next feature requires a click on a button to temporarily view a severely limited list of what is coming next, and manipulating it is a hassle.
It is possible however to create a smart playlist that more or less behaves like the good old iTunes DJ.
The screenshot is pretty straightforward, it shows the most basic setup but you can tailor it to your needs with additional rules. The only essential things are the fixed limit “selected by random”, the “Live updating”, and the “Last played not in the last …”. All these things combined will automatically remove a played song and cause a new one to be added to the queue, just like in iTunes DJ. The time span for the “last played” rule is not essential, anything starting from 5 seconds should work. You can use a larger value if you want to avoid hearing the same song twice within a certain time span.

Next, make sure the playlist is sorted on the very first column (the one with the numbers and no header title), and disable shuffle. Then you can arrange the songs to your liking, and start playing. You will notice that whenever you remove songs, new ones will be added at the bottom just like in iTunes DJ. The only aspects in which this differs are:
  • The biggest drawback is that it is impossible to add specific songs to the playlist. If you want to play one or more specific songs that were not randomly picked, your best option is to use “Up Next” after all.
  • There is no practical way I know of to increase the chance that higher rated songs are played more often. Not a big deal for me since I never used this.
  • One cannot see the played songs, you will need to use a separate “recently played” smart playlist for this.
  • Of course the feature to let people vote for upcoming songs is still missing.