7/07/2006

Big Brother - Part III

Back to big brother, is there something that may be called big brother on the Internet?
Yes, there is.
Is not what you're thinking of, there's not such thing as a huge database with all our names and a log of each and every session we do on the net. At least I hope not. But there are other things that, in the common user perception, are almost as scary as that.
Right now you're being under my scope. This blog has a log feature (courtesy of eponym) like any other web server. This log tells me where you come from, your IP address, what browser are you using, the timestamp, what you requested, if you clicked on a link and where was the link, etc. It doesn't say much about you yourself. I wouldn't be able to follow your steps unless you use exactly the same browser from the same IP and even then I wouldn't be sure it's you all the time. The idea of this log is to help the owner administer the site, check resource requirements, adjust the design of the page to serve all the diferent browsers, etc.
But let's say that I give you one option to "improve your reading experience", something like choose your own font, choose your own background color. Unless I can identify you, you'll have to repeat the choice every single time. One option for this would be to make you open an account and save your preferences. The other is a "cookie".
A "cookie" is a piece of data related to a site that is stored in your computer. The cookie allows the server to recognize you from request to request, remember your preferences and follow your steps.
But before going further into this, there's something you have to understand about web servers. Let's say that you log into your webmail or go to a news web page. You spend some time in there and call that a session. The server knows you, because you said who you are, or not in the case of the news page. But in both cases you notice that the service was oriented to you. Your webmail allways shows your inbox with your messages and sends in your name. The news page allways shows those headlines related to the topics you had chosen previously. You don't have to identify yourself or repeat your choices every single time you open a new page.
However, from the technical point of view, a web session is a request for one element and one element only. When you open this page, you send a request to the server for the index.htm document. The server sends you that file and you close the session (your browser does). The index.htm file is just the text and the format of the page, you can check it out with the option View Source in your browser. Once your browser has the HTML file, it starts asking for the elements required to build it for you. The images, java scripts, any multimedia file, etc, they're all referred in the HTML file and requested to the server one by one in different sessions. By session I mean a TCP/IP session, your browser opens the session, the server acknowledge your request, your browser sends the message requesting one element, the server sends the element, your browser closes the session. Up to this point, this is what HTTP protocol does, no more, no less.
The protocol itself has no way to know that it's you through all those sessions and for the most plain and simple pages it hasn't the need to do so. Like in this page. Any request of any element will be served exactly the same regardless of the client. But if you're using your webmail or your bank account, the server needs to know who you are in order to build a page with the information relevant for you.
The cookie does that, the server creates a virtual session, assigns a code to it and sends it to your browser in a cookie. Every time your browser sends a request to the server, it sends the cookie too. The server knows that that particular cookie was generated and sent to you at the time you identified yourself, hence any request bearing that cookie must has come from you.
That's a session cookie, is good only during that session. The cookie is created with a short lifespan, in the order of hours or minutes, and should be discarded when the browser closes.
There are also persistent cookies, cookies with a long lifespan, even beyond reasonable limits that we can call eternity for practical reasons. Those cookies are the ones used to "improve you browsing experience". They store your site preferences but most of the time is just one code, an ID code. The server stores your preferences and link that set to the ID code sent to you in the cookie. From that point on, all your requests include the cookie, the server looks for you preferences and personalize the page for you. Nice, isn't it.
Also now, the server is able to track your steps from session to session. Let's say that you visit your favorite bookstore and spend some time looking for books about gardening. Then, on your next visit, half of the books highlighted in the front page are about gardening. Have they read your mind? Is this a case of Jung's synchronicity? Of course not, your browser now has a cookie and your cookie has been linked to many search requests for "gardening". The server does it to improve you "shopping experience". And to make it more likely that you buy a book.
Now this seems intrusive, they're really tracking your every step, what you look for, what are you into. Yes, is true. But unless you open an account with them and identify yourself with your real information, they have no way to know who you are. And, most likely, they don't care.
Is that so bad?
I bet that there's at least one store where you drop by frequently. A coffee store on the way to work, a deli, a drugstore, a tobacco store. If there is, chances are that you're served before ordering most of the time. The server acknowledge your "cookie", you, the real you, and has it linked to your preferred mokachino, sandwich or cigarette brand. We don't see this as intrusive. However, a stranger is aware of our preferences, where we buy, when we buy, what we buy.
The difference is in our minds. The desk clerk is human, the server is not, we have a natural inclination to trust humans and distrust machines. On the other hand, we don't pick up a porn magazine in front of a human clerk but we take it from the server that we distrust.
I know, is not easy to understand. But the human mind is too complex to be explained in this blog.
Moving forward with Internet and privacy.
So far, we've been through some of the ways a server can look over our shoulders. None seems to be really scary. Even a persistent cookie looks harmless, it doesn't carry our identity, is limited to the server who issued it. And you have many ways to avoid them.
In you browser settings is an option to set policies for cookies. The options change from one brand or version to the other, but basically are whether accept cookies or not, what to do with them and a list to discriminate servers for specific actions.
Nowadays, a policy to reject cookies is a bad idea since most sites involving long sessions, like webmail or shopping sites, rely on cookies to operate. So at least you have to allow session cookies, optionally you can designate the sites you use. Also you can have a list of sites that you want to keep your preferences. Then you either block all the rest or set a policy to delete all cookies when the browser closes or go to your browser settings and delete them yourself.
That's a good set of policies if you're worried about cookies. I prefer to delete them myself, but not from my browser settings, I go to the cookies directory and take them all out.
So you can do the same, go to your Document and settings directory, there must be one with your profile name and in there a Cookies directory. This is if you're using a 32 bit version of Windows or later, other operating systems and browser may have their own separate directory. Anyway, you'll find a list of files, most likely with your_name@some_domain. Each file is a cookie related to that particular domain, so you'll find there a list of some places you visited and some you didn't. Yes, you've read that right, some that you THINK you didn't. I'm sure that you've never been to 2o7 or doubleclick or zedo or webtrends, and the list is a lot longer than this.
Now you must me wondering how this happened, I said that you get a cookie when you visit a site, you only get a cookie related to that particular site and your browser sends cookies only to the site they belong to. And all of this is completely true, at least I hope so. The answer is that you visit a huge number of sites without knowing it.
Let's go back for a second to the HTTP session. When you ask for a page, the server sends you the first element, the file of the page itself. It contains all (or most of) the text, formatting information and the references to all the other objects. But those objects could be, or not, on the same server. So you get your page from server A, the HTML text says that an image is required and that is located at server B. Your browser opens a session with server B, exchange cookies if needed, and gets the image. Meanwhile, you've visited a site you didn't explicitly ask for.
This is not against any rules, it's totally normal although unexpected for the common user. Some of this links are used just because the page requires that element from other server, for example some forum pages don't allow users to store avatar images on the server. You have to store it somewhere else and configure the link in your profile. Every time a page has to show your avatar, includes the link to the server you designate for that. These cases most likely don't use a cookie.
Most of the links that use cookies are advertising, pages that have contracts with doubleclick or zedo are paid for setting a link on their pages. Every time you request a page, a request or more are sent to the advertising server for the elements required to complete the page. Those elements may be allways the same, or changed frequently or rotated among group of ads. Those servers need to keep track of each and every request made to show result to their clients and pay to the page owners. They set cookies for many reasons, they want to know how many different persons were exposed to each ad, they want you to see as many different ads and, if you clicked one, they want to send you those ads that you're more likely to click.
Remeber that one rule of the cookies is that they're only related to one site? They are. The cookies from ad server A are and will be exchanged only with server A. The problem is that server A is being referred from sites B, C and D, the sites you're visiting. Now, server A can tell when and where you visit each of these sites, if you pick an ad from B they'll send you related ads when you visit C and D.
This is targeted marketing and I doubt they use it for any other evil purpose. In fact most of them just control the number of exposition for each ad, balancing diversity and quotas, showing each user as many different ads as possible and reaching the goals required for each paying advertiser. The selection of topics is done beforehand, porn ads in porn site, foods and wines in epicurean sites, etc.
Google does this topic analysis for its AdSense program. The topics are chosen based on the statement of the site owner who subscribe for the program but also by the content. It's not very accurate. Suposse that you have a site about the red lobster of the south Pacific (I have no idea if such thing exists), you're trying to bring awareness to the general public about this creature in danger of extintion due to excessive fishing and habitat degradation by human activities. AdSense could fill your site with ads about lobster restaurants, fresh lobster on sale and lobster recipes. But taking into account the huge number of ads showed up every minute, the results are good. Otherwise, people won't pay for it or take is for their sites.
I don't know if Google is doing what I'm about to mention, if it starts to do it I hope they send some money my way. The system gets more accurate as more users choose ads. In the lobster case most users would ignore the ads, making them less likely to be reassigned to that site. On other sites, where the ads match the content of the site and the interest of the visitors, the click rate is high making them more likely to be assigned to that site and others with related content or linked from there.
I don't like ad laden sites where you have to dig for the content you're looking for, not mentioning those sites that are ads, no content. But at some point I have to compromise. I like the idea of having free web sites with content I can use, news, recipes, instructions of any kind, reading material. The owners of the sites need an incentive to keep doing it and the money is THE incentive. Web sites with ads are a good thing because they'll keep those sites free for everyone else, however, small sites don't have the mass of visitors required to negociate with advertisers directly. Ad servers filled that gap, dealing with a large number of sites in hand that can provide that mass of visitors for the advertiser.
The last group of the unkonwn cookies in your directory (and mine) is the most scary of all. This is the one we can call Big Brother. I know for sure that you have at least one 2o7 cookie. And the reason why I know that is because almost all the most popular sites have links to it. The owner of those cookies is a company called Omniture, probably the biggest of its kind but not the only one. Omniture is doing statistical analysis. They basically count every single time one of their links is requested and relate it to the connected cookie. Each time a link is requested, they know if you have one or more than their cookies (if not they send you one right away), what page you've just opened, the time of the request, the server who served that page, your browser brand, some of the basic options you have set and some other minor information. This information doesn't seem to be valuable at first, it doesn't include your identification and I don't think they really care about it. But if you put it togheter with all the millions of little bits of information, things looks very different. Of course, it takes talent to make out valuable data from such a huge pile of bits and Omniture seems to have it, being the most successful in its class.
Evil as it may seems, there's nothing wrong with it. Let me rephrase it, I can think thousand reasons why is wrong to do that, but not one related to the privacy of the users. The owners of the sites has the right to know at least how many times their pages are visited, they even have the right to know who is reading their pages. Some do and request you to register and ask for your name, your address, your phone number. Some even go further and request evidence of your identity to register. But it's your choice to do so. Once you voluntarily access one site, they own that bit of information about you.
On the practical side of the matter, your identity means nothing. There's no sense or need to know who you are. Statistics and statistical correlation have no meaning unless the number of events measured is huge.
Let's say that you have a die, you know that the odd of having a certain number in a throw is 1 in 6, one sixth. You assume that all of the numbers have the same probability. You throw it once and the probability of having any number is the same for the next throw. However, statistically, the number you've got on the first throw should be slightly less probable because, in the long run, all the numbers should appear about the same number of times. Sounds like a paradox but it isn't, the uniform distribution after a large number of events is a consequence of those events having the same probability. The key here is the large number of events because, as any Yahtzee player knows, rolling the same number many times in a row is possible. But if you roll the same die six thousand times, you should get each number about one thousand times. A small deviation is expected but if you get something beyond 2 or 3 percent, you better get that die checked.
Statistical analysis is based on this. Human behavoir can't be calculated in terms of probability, at least not before hand. But if you measure some event a large number of times, you can infer the probability from there.
I'll give you one example of correlation. Imagine a graph showing age of the people against a list of sites that people visited for a period of time. After you plot the first 10 points, that's what you've got, 10 points scatered across the graph. While the number increases, you can start to see trends or that there are none. A site with an even distribution of points along all the age scale, has no correlation with it meaning that age is not a factor for that particular site. If a site is more popular among people of a certain age, that part of the line have a higher density of points. And same going across the age's scale, sites more popular for each age segment have a higher density of points.
Not so many years ago, statistical correlation wasn't so popular just because it wasn't easy to get large number of measures to analyze. Of course some statistical analysis was done, but on most cases the number were not big enough to make the analysis accurate.
Internet changed that. Not only you can get millions of millions of measures, you can get millions of different events. Even more, you can link different events to the same person. It doesn't matter who he or she is, what's important is that those events are related to the same person. And, best of all, recollection of data is done automatically.

As you can see, someone's looking over your shoulder while you surf around the Internet. I think that marketing is evil, this kind of marketing is even worse than evil. But not because our personal privacy is being violated, I don't think it is, is because our collective privacy is being violated. We, as a human group, are being closely watched, scrutinized and disected. But I won't complain, I'm still feeling that we're far away from 1984.

One last comment about Omniture. If you go to 2o7.net, you'll get to a page where Omniture explain briefly the meaning of all those links you find on some other site's pointing at 2o7.net. Don't expect an apology. They do this on behalf of their customers, the web sites, so you go check the privacy policy of each of site. And they're right.
The funny thing is that they have at the end of the page a link that allows you to opt out the system. If you don't want to be watched by them you just have to click there... and get another cookie.

No comments: