A Plausible Threat

Three years ago I wrote a post about the difficulty of convincing people of the value of privacy. I argued that targeted ads — creepy though they seem to privacy advocates — are not a good argument to convince others of the value of being parsimonious with personal data.

The recent discovery of the app Girls Around Me provides a concrete illustration of the dangers caused by promiscuous sharing.

Girls Around Me uses Foursquare, the location-based mobile service, to determine your location. It then scans for women in the area who have recently checked-in on the service. Once you identify a woman you’d like to talk to, one that inevitably has no idea you’re snooping on her, you can connect to her through Facebook, see her full name, profile photos and send her a message.

While the app is creepy in its own right, Charles Stross analyzes what allows this app to exist.

… the app is not the problem. The problem is the deployment by profit-oriented corporations of behavioural psychology techniques to induce people to over-share information which can then be aggregated and disclosed to third parties for targeted marketing purposes.

Furthermore, Stross describes dystopian visions of using the widely available data for purposes far more creepy than Girls Around Me:

Facebook encourages us to disclose a wide range of information about ourselves, including our religion and a photograph. Religion is obvious: “Yids Among Us” would obviously be one of the go-to tools of choice for Neo-Nazis. As for skin colour, ethnicity identification from face images is out there already. Want to go queer bashing? There’s an algorithm out there for guessing sexual orientation based on the network graph of the target’s facebook friends. It’s probably possible to apply this sort of data mining exercise to determine whether a woman has had an abortion or is pro-choice.

Chomsky, Norvig and Practical Machine Learning

Is it more important for science to describe how things behave or to explain why things behave the way they do? This seems to be the question behind statements Noam Chomsky made at a symposium regarding machine learning and an article by Peter Norvig discussing Chomsky’s opinions.

Chomsky is highly critical of the commonly used statistical models for machine learning that focus on the “how”-part. He discounts the practical success of these models as unimportant for the advancement of science. His main goal – as far as I understand it – is to find the principles on which language is based.

Norvig argues in favour of statistical models and purely descriptive research as something well worth pursuing and not as rare in the history of science as Chomsky claims it is. He cites several examples from Chomsky’s publications that show a lack of knowledge about the capabilities of machine learning algorithms. Norvig’s most important argument – to my understanding – is that our current statistical models perform better in describing the reality of language than our current explanatory models.

Regardless of the advantages of statistical models, explanatory models definitely are easier for humans to talk and think about. And however our brains really learn languages, rules are how we teach them. Beyond the scientific considerations Norvig’s article and Chomsky’s statements focus on, I find this aspect relevant for the practical use of machine learning as well. Many applications or services that rely on machine learning suffer in usability, because you cannot really understand why they do what they do.

Spam filters are a good example for this problem. Originally, programs like Spamassasin only used explicit rules (e.g. does the text contain the word “viagra”) to determine whether a mail was spam. For each matching rule a certain number of points is added to the mail’s spam score. And usually the program adds a header where all the matching rules and the number of points resulting from it are listed (e.g. DRUGS_ERECTILE=0.282). These scores are helpful when looking why some mails where not classified correctly.

Spam detection was, however, greatly improved by the introduction of Bayesian filters. These probabilistic filters are trained with corpora of mails marked as spam or not-spam and calculate the probability of new mails being spam. In Spamassassin, this results in a single rather opaque score BAYES_99=3.5. The other descriptive scores are still there, but from a brief look through my recent spam, the Bayesian classifier contributes the most significant numbers for the mails that were correctly filtered.

The good news is that statistical models don’t always mean that you won’t get a good explanation. Amazon, for example, shows you why you get a certain recommendation. Below each recommended item, there is a link to “fix this recommendation”.

Fix recommendations on Amazon.com

The page you get, shows items you bought or looked at that the recommender thought were similar to the one you now get as recommendation. The page also gives you ways to tell Amazon not to use these items for you in the future.

In applications with a machine learning component, giving some explanation or reasoning goes a long way to improve the usability.

On the News

Mandy Brown wrote a really good post about the evolution her news-reading behaviour. She started out reading one newspaper and than transitioned to reading many different online news sources. I can absolutely relate to that. I also share the transition in the behavioural pattern of reading news: with the newspaper I read it once in the morning during breakfast; now I’m checking the news all day.

An than Mandy talks about expectations from the news she reads:

I want a reading experience that defends the news from the circus that online advertising creates. I want good storytelling and analysis, not naked facts. I want news that admits and defends its point of view (and acknowledges that there is a truth to be uncovered), not news that parrots the party line while making claims to objectivity. I want long essays on the events at Fukushima and the consequences for nuclear power going forward, not shrieking dispatches of each new fire or setback. I want a history of American engagement in Libya, putting the events of the past few weeks in context. I want twenty thousand words on the recession and its effects on the middle class, not another lone statistic about the unemployment rate. I want thoughtful, investigative journalism that exposes the ways in which our government is failing us, so that we can make it better.

I can say ‘yes’ to each of the points, except for the second item. I do want good storytelling and analysis, but I also want links to the naked facts. When I read about a planned increase in taxes for Diesel fuel where one quoted expert claims that preferring Diesel is bad from an environmental point of view, while car manufactures claim that Diesel is good for the environment, I want links to studies about the different properties of Diesel vs. gasoline with respect to environmental issues. When I read an op-ed about health care, where the author claims that the largest part of the health care costs for a person are incurred in the final year of their lives no matter how long the person lived, I want a link to a statistic supporting the claim. Bonus points for additional links to stats that do not agree and an explanation why they are less credible.

I have another addition to this wish list while we’re at it: I do want news that admits and defends its point of view, and I also want pointers to articles with different points of view.

Link: Punk Rock Languages

Chris Adamson wrote a polemic about programming languages for the March 2011 issue of the PragPub Magazin that is both entertaining and thoughprovoking.

The natural appeal of the language is to write software with it, not to mess with the language itself—Solve your users’ problems rather than indulging your own programming fetishes.

Link: Presenting Like a Hacker

After blogging like a hacker with Jekyll by Tom Preston-Werner, I recently came across a similar thing for presentations: Showoff by Scott Chacon. With it you can create presentation in Markdown and show them in a browser. This is particularly usefull when you have code in your presentations which is a real pain in other applications.