Richard Jones talks details, remarks on the site's recent redesign, and hints at what 2009 will bring

(This blog post is a full transcript of the half-hour interview whose summary was posted last week. While it is certainly long, Richard goes into great detail about some of the decisions and goings-on at one of the world's largest music portals. The following article is the result of joint collaboration between DailyTech,, and Sun Microsystems.) has a pretty large database of information that listeners have input in there through our scrobblers. What’s it like acting as proprietors of such a large database of listener’s habits?

Well, it’s great obviously, it’s what our service is built around and it’s a major asset. It’s great to have all that data that’s fairly unique as well – I can’t think of anyone who has that kind of database that uses it for the same things we do. It gives us a unique opportunity to do some quite funky things with the data.

That’s one of the fun things about working at as well: there’s so much knowledge and so many things that you can extract from that database. Obviously, we’re doing our best in doing a bunch of stuff with it; we’re always looking at it in different ways and always sort of thinking, “what happened if we tried this, or what happened if we tried that?” and we can actually go back to the raw data and runs some numbers and come up with some other ideas. So yeah, it’s great.

What were the inspirations behind and Audioscrobbler?

There were originally two different projects, really. I was working at Audioscrobbler in 2002. Felix Miller and Martin Stiksel were working at completely in isolation but [we were] only a few miles apart from each other. The inspiration behind was that Felix and Martin were originally running an online record label where you could upload MP3s, but they had so much content that they didn’t know what to play people.

[] used to have a radio station that was just random, and so they wanted to help people find the right music to listen to, and really grew out of that.

At the same time, I was working on Audioscrobbler and my motivations were basically to be able to discover new music without having to do all the legwork of reading all the music magazines and keeping up to date with current affairs and so on, so I wanted to find a technical measure to discover new music – but I was also partly interested in the sort of personal statistics side of things.

People would ask you, “Whats your favorite band?” It’s a hard to question to answer, first of all. Technically when people answer, what they say their favorite band is isn’t always what they listen to the most. It’s what they perceive to be their favorite based on what’s trendy or what some of those other influences facts are. It was quite interesting form me to see the difference between sort of perceived tastes and what you thought your favorite music was, compared to what you’re actually listening to the most. For most people, there is a discrepancy there that was interesting to find out.

What kind of hardware powers the main site? What about the Audioscrobbler database?

I checked how many servers we’ve got, and we have about 350 to 400 powering the whole service. Obviously, we do a lot of different things: we have the radio side of things, the number crunching, and the web service. The hardware that we use is fairly standard stuff: it’s all Intel and AMD machines, all rack-mounted hardware. We’ve have some blades as well. There’s not really any exotic hardware.

I’m told your site is a big customer of Sun Microsystems?

We actually have a mix right now. A few years ago we were buying from a local supplier here, and over the years it became more important to us to get really power efficient equipment, because at the data centers in London and the UK power is a real premium; it was hard to get enough power. So we started looking around for machines that were more tailored to low-powered stuff.

So we have a mix of different suppliers but right now we’re buying quite a lot from Sun. We just got some new low-power blades that we’ve put in to do web serving, and our main database – with which we use PostgreSQL – is also on Sun hardware, for example. So yeah, we’ve been getting some good stuff from them. Sun seems to make a good range of servers that are quite conscious on the power requirements, and are quite good about giving you the spec about how much power they’re going to draw.

Out of curiosity, how much space does it take to store such a big database? I’d imagine that probably stretches into the hundreds of terabytes.

We have the database itself, there’s the raw data, and then there’s all the mp3s as well, and then there’s all this additional data that we’ve computed over the top in kind of different layers. Yeah, it’s in the hundreds of terabytes, though I can’t give you an exact number.

We actually do a lot of our storage and processing in Hadoop, which is a framework based on a paper that Google released on the same subject. So, that’s actually a distributed computing framework written in Java.

How big of a challenge is it to normalize, or clean up, the data that Audioscrobbler receives from clients phoning home? I’ve noticed some pretty amazing corrections to metadata in my music collection over the years, just by paying attention to my “recently listened tracks”. A Japanese artist will, for example, show up in printed in Japanese characters as opposed to whatever I had entered [in my MP3 file’s tags].

I’d say that’s one of our biggest challenges, trying to stay on top of massive cleanliness problems. For everything we fix, another 10,000 people scrobble the song with the wrong spelling, so it’s a never-ending battle, really.

But earlier this year –actually right at the start of this year -- we released a fingerprinting system that really helps us. So in the scrobbler software now, as well as scrobbling the names you claim, it actually reports an audio fingerprint. That’s actually helped us behind the scenes to match up the songs with all the same but have a different spelling. We’ve made a lot of progress this year, and although not a lot of it is visible yet, we think that next year we’re going to roll out a lot of these changes and actually fix even more problems. It is a huge challenge; the common numbers are something like 300 million different tracks that we’ve recorded (that’s in tons of different spellings), and about 20 million different artists – but obviously not all of those are valid. So that’s the challenge: we still haven’t quite answered the question of how many unique artists there really are -- there’s obviously much less than what we actually have because of all the misspellings. It’s an ongoing problem and it will never be solved, because there’s always new music being released as well and so you have to constantly keep updating the system. But we’ve made a lot of progress, and we’re working on that for next year as well, so we’ll continue to address it.

As a user since 2005, whose play count is close to 20,000, I have always had equal parts apprehension and fascination with the “Recently listened tracks” feature. I’ve heard all kinds of stories about how that feature has been used or misused: bosses checking up on employees, ex-boy/girlfriends stalking former partners, and people checking to see if someone’s at their computer by checking if they played anything recently. I’ve noticed that you guys have played around with the timeliness of that data and when it’s available to the general public – but what’s’s official position on this feature? Has it been a controversial inclusion?

That feature has been there since the very first version, and it’s always been one of the most popular features that people actually talk about – because people actually use it and put it on their blog and keep it updated. So I think that, for the most part, people really love it.

We did introduce, earlier this year, an option to hide all the real-time data: if you don’t want anyone to know if you are online right now, you have the option to disable all your real-time data which includes recently-listened tracks.

Some people are a bit concerned about it, but part of our service is to broadcast your music tastes to the world. So it’s part of what we do, it’s quite a big part really: actually saying to the world, “this is what I am listening to right now,” and wouldn’t be the same without it. But like I said, we do have the option to hide that data if you want to keep that a secret.

Personally, I like it – it’s a great feature to have.

We have some interesting stories over the years, actually, where people have used it [to help track down a stolen laptop.] We get emails once or twice a month saying, “my laptop was stolen, and I can see the person who stole it is playing music on my iTunes right now,” and then we have actually helped the police track down people’s laptops … from the scrobbling feed on their account.

Was that in the U.S. or in the UK?

Yes, it’s happened in the U.S. actually – it’s happening around the world but people in the U.S. have contacted us a few times.

We don’t make a point of logging the IP address [in our data collection], but when [thefts have] happened we put a watch on the account, allowing us to collect the IP address the next time it’s used.

Do you have any thoughts on the weaknesses of Audioscrobbler/’s methods for figuring out various artist statistics? For example, Nine Inch Nails is now my “top artist” by a wide (230+) plays margin, simply because “Ghosts I-IV”, with its 36 tracks, turns out to be great background music for writing. Play that a few times and all of the sudden Nine Inch Nails now has twice the weight compared to other artists who put out a more conventional CD. Has run into statistical anomalies with things like this?

That’s a good question. We’ve had many people suggest different ways over the years; one of the common things that come up in our forums is that people say, “You know I’d really like to track my tastes based on the number of minutes I’ve listened instead of the number of songs I’ve played.”

We’ve introduced a couple of different ways to deal with this kind of thing. One of the things we’ve done more recently is divide up your listening into different time periods now; in the past, you used to have just one chart which [contained] your overall top artists. But now we have weekly, monthly, three months, six months, twelve month [charts], so we don’t necessary look at what you listen to over all time.

When it comes to recommendations and radio, it takes into account a whole bunch of different factors as well. We try to figure out when it’s appropriate to play something – we don’t just look at the number of plays. We look at a bunch of other things as well: tags, time of day, the context, and things like that. So we hope it doesn’t skew the system too much.

One of the other reasons we track the play count like that is because when Audioscrobbler and were conceived, all the existing music recommendation services back then (which was early 2001, 2002) used to ask you to rate stuff with a 1-to-5 star system, or like, give it marks out of ten. That was actually a huge amount of effort to put in, and it didn’t seem to give very good results. You’d spend ages rating stuff and in the end it didn’t particularly reflect your tastes as well as it could have, so we think that just tracking the number of plays is the best balance to figure out your tastes.

In the end, we want to recommend new music based on what you actually listen to, not just what you say you like, because that tends to give better results. recently introduced a new site design that seemed to have met with a bit of a mixed reaction among long-time users. A lot of people felt the old design worked pretty well. Why the redesign?

Since we’ve started, we added a lot of features to We are very feature driven. We reacted to what our users said they wanted – they would ask for an events feature so we added an events feature, for example – and we gradually added more and more things to the site. We felt that the site design and the layout had, over time, suffered because we’d added a lot to it without stopping to think and reorganizing it. What we did this year was sort of took a step back, and looked at all the features and the things we’d added to the site, and then rethinked how we’d lay them out and make them more accessible.

We did a lot of usability studies, and we did a lot of tests with some of our existing users. We have some usability labs in Las Vegas that we used for that as well. So what we did was we ended up with a new design that we thought people would find easier to get around and easier to understand. But obviously a lot of our users knew the old design really well; it’s always hard to adjust and it was a bit of a shock to the system initially for a lot of people. Looking back now, with the benefit of hindsight, I think we would have spent more time introducing it to people and getting a bit more feedback.

It would have been nice to have a much longer beta period, and the beta would have addressed a few of the other concerns that came up before we launched.

We learned quite a lot from that experience, but I think on the whole it was for the better.

What does have planned for the future?

Ooh, well, some more of the same. We’re expanding onto a lot of different devices now; that’s been a bit of theme recently. We’re on the iPhone, we’re looking very seriously at an Android app, we’re on the Sonos, we’re on the Logitech squeezebox, and we’re on more devices than we can keep track of. We’re trying to make sure that wherever you listen to music, is there, and you’ll be able to scrobble the songs that you listen to.

One of the things we hear from users is that once they start using, and once they start scrobbbling their music tastes, they feel like it’s a waste if they actually listen to music on a system where they can’t scrobble it. We’re trying to make sure that is available everywhere.

Of course we’re going to be putting a lot more effort into the website as well, looking at what features we can improve or add, and in general improvements as well. Also, recommendations are still very important to us, and we will be working on our recommendation system … we think that’s going to be a big thing in 2009, because obviously there’s going to be more choice. There’s more music being made all the time, so we need to stay on top of the game there.

We think we’re in a really good position right now, we think we have the best music recommendation engine, but we also have to keep working hard to maintain that position.

This is more of a personal request, but I have to admit that “paint it black” is one of my personal favorite features. The preference for this setting is not saved to my profile, though – it seems like I have to click that every time I log in. Any chance of having that permanently saved?

[chuckle] It should be stored in a browser cookie, but I guess if you log out then it destroys the cookie. The “paint it black” thing is a popular feature; I guess I’ll pass that on to the web team and see what they have to say about it.

I can’t promise anything about it now.

"Vista runs on Atom ... It's just no one uses it". -- Intel CEO Paul Otellini
Related Articles

Most Popular Articles

Copyright 2018 DailyTech LLC. - RSS Feed | Advertise | About Us | Ethics | FAQ | Terms, Conditions & Privacy Information | Kristopher Kubicki