Original post on Zemanta’s blog, reproduced here for posterity: It’s every advertiser’s worst nightmare: advertising on a seemingly legitimate site only to realize that the traffic and/or clicks from that site are not the coveted genuine human interest in the ad. Instead they find fake clicks, unintentional traffic or just plain bots. In a never-ending quest for more ad revenue, website publishers scramble for ways to impersonate their more successful counterparts. However, not all approaches are as respectable as improving readability and SEO. One pernicious tactic is sharing traffic between two or more sites. Of course, almost all websites share some of their visitors, but this percentage is small. Moreover, as the site accumulates more visitors, the probability of a large overlap occurring by chance becomes infinitesimal. This tactic is commonly used by botnets, so that the sites employing this traffic can also be unwitting targets of such schemes. For example, a botnet can, among the suspicious sites, add several well-known and respected websites, so that the apparent credibility of the malicious sites is artificially boosted. The question is thus, can we identify these traffic-sharing websites? And if so, then how? The answer to the first question is yes, and to the second is this blog post. Our problem lends itself nicely to a network approach called a covisitation graph. We will construct a graph, such that the sites that share traffic will be tightly connected. Especially if visitors are shared between several sites, as is usually the case. We can […]

I recently finished a 10-week research visit at Stanford, working under Prof. Jure Leskovec. Here’s a short summary of my visit, and you’ll soon be able to read more about the research I did there. You can also check out my facebook photo album. After settling in on campus in the graduate residences, I went around to look at everything Stanford has to offer. Its architecture is unifying and gives it a very distinguished look, Some of its most beautiful buildings are the Huang Engineering Quad, the Oval, the church, and the main Quad. One of our main excursions was the visit to San Francisco. Me, Vid, Jose, and Klemen Kotar gathered at Uber headquarters for a workshop, after which we toured all around the city, walking along the famous Market St with its abundance of skyscrapers, the financial district, as well as going all around the coast on Embarcadero St, passing by the seals, from where we could see the infamous Alcatraz, and even get a glimpse of the Golden Gate bridge from afar, shrouded in the characteristic San Franciscan fog that envelops the tall buildings even midday. We also visited Lombard St, the “crookedest” street in SF, as well as the chocolate factory Ghirardelli, where we got some free chocolate! The week after, me and Vid went with France and Mia Rode to a picnic to the Twin Pines Park in Belmont, where many 1st, 2nd or 3rd generation Slovenians gathered for an afternoon of pleasant company and good food. […]

My first journal submission just got accepted! It was a final improved and polished version of the segmentation work I presented at ERK last year. The arXiv preprint is available for now, but the final version will be published when the paper appears in this year’s Elektrotehniški vestnik (Journal of Electrical Engineering and Computer Science). The main differences between this version and the previous conference paper are the improved accuracy, and the added different pre-processing algorithms, as well as a more overall polished method.

Data Scientist: The Sexiest Job of the 21st Century Now that I got your attention…It seems like everyone and their manager wants a data scientist in their company to boost profits and use #bigdata, yet there does not seem to be a good definition of what a data scientist is supposed to do or even what kind of knowledge and expertise he/she must possess. From Drew Conway’s famous Venn diagram that probably oversimplifies things, to the recent length discussion on CrossValidated, the aptly-named stack exchange for statisticians, that probably overcomplicates it, I will not try to present a succinct, yet encompassing definition which is just going to get lost in the sea of failed attempts. But we can at least enumerate the plethora of inter-disciplinary skills that data scientists are expected to have. The degree requirements alone showcase the versatility of this position,  ranging from a degree in any of the following: Computer Science, Statistics, Applied Math, Physics, Engineering, or basically any quantitative field. On top of this, the degree can also be either a BSc, MSc or PhD in any of these areas. Now, turning to the skills, we can split them into a few broad areas of expertise, and the more the better when it comes to a candidate possessing them. So basically, you’re expected to be familiar with every concept described below. Computer Science R & Python – You want a scripted language for fast prototyping, and these two are equipped with excellent data manipulation (numpy, pandas) and visualization (ggplot, matplotlib), in addition to machine […]

Dean’s commendation for academic success

This year I received an award given out to the students with the highest grade average of the past year. I was honored to receive the award, of course, but I was also delighted about the book I got (where they put the certificate as the first page). In an incredible coincidence the book awarded — Richard Dawkin’s “The Selfish Gene” — was exactly the book I was currently reading on my e-book reader. Although I was halfway through the book at the time, it was a joy to have the rare experience of reading a book in paper format. A full review of the book will be coming soon.

Presented a short paper on machine learning algorithms at this year’s Information Society multiconference. It was a continuation of a project for my Machine Learning course. Prof. Bosnić and I looked at which feature selection techniques and which machine learning algorithms work best for gene microarray data, which has very few observations and many features (genes). The most interesting finding was that genes that were predictive of one cancer were also predictive on other data sets with different types of cancer. Our paper can be found in the proceedings under the Intelligent Systems section (Volume A, pp. 17). More info at my research page.

A bit overdue, but (the bulk of) the work I did for my thesis was finally presented at this year’s IEEE ERK (Electrotechnical and Computer Science) conference at Portorož, Slovenia. I did my bachelor’s thesis on unsupervised image segmentation under prof. Matej Kristan of the Visual Cognitive Systems Laboratory at FRI. After improving our approach, we decided to present it at ERK. I was a bit nervous since this was my first oral presentation at a conference, but it went great and the discussion was helpful and interesting. You may find our paper: A regularization-based approach for unsupervised image segmentation Dimitriev Aleksandar, FRI, Uni Lj Kristan Matej, FRI, Uni Lj under the Pattern Recognition section. More details, as well as my other research, can be found at my research page.