Sunday, February 23th

Recommender solution on Heroku

This is an out-of-the-box solution to implment a recommender systen on heroku.

Require.js presentation

This is a very interesting and very clear presentation about require.js and AMD principle.

Giraph and Okapi

Apache giraph is ‘is an iterative graph processing system built for high scalability’ and Okapi is an very promising set of tools for large-scale machine learning and graph analytics based on Giraph.

Finding of the last weeks

Posted in Apache Giraph, Graph, Okapi, Python, Uncategorized

Monday, February 3rd


The swirl R package is designed to teach you statistics and R simulateously and interactively. If you are new to R, have no fear. Following 4 simple steps you can start to enjoy R. First get R and install on your system, second it is recommended to install RStudio to make your experience with R much more enjoyable. Once installed, open R and get swirl package executing this command install.packages(“swirl”), then you can start your journey with library(“swirl”)

F11 Introduction to Computational Linguistics – LIN386M Home

Introduction to Computational Linguistics introduces the most important data structures and algorithmic techniques underlying computational linguistics: regular expressions and finite-state methods, categorial grammars and parsing, feature structures and unification, meaning representations and compositional semantics.

MySQL Utilities

MySQL Utilities is both a set of command-line utilities as well as a Python library for making the common tasks easy to accomplish. The library is written entirely in Python, meaning that it is not necessary to have any other tools or libraries installed to make it work. It is currently designed to work with Python v2.6 or later and there is no support (yet) for Python v3.1.

80 resources for learning D3.js

These are some books, tutorials, screencast videos and courses for learning D3.js, the wildly popular JavaScript library for manipulating documents based on data, created by the genius that is Mike Bostock. And also take a look at this link D3.js 101. Last but not least this is a huge list of D3.js examples.


Posted in Uncategorized

Saturday, January 4th

Differences between inner class and nested static class in Java

Both static and non static nested class or Inner needs to declare inside enclosing class in Java and that’s why they are collectively known as nested classes  but they have couple of differences as shown below:
1) First and most important difference between Inner class and nested static class is that Inner class require instance of outer class for initialization and they are always associated with instance of enclosing class. On the other hand nested static class is not associated with any instance of enclosing class.
2) Another difference between Inner class and nested static class is that later uses static keyword in there class declaration, which means they are static member of class and can be accessed like any other static member of class.
3) Nested static class can be imported using static import in Java.
4) One last difference between Inner class and nested static class is that later is more convenient and should be preferred over Inner class while declaring member classes.

How does DHCP work ?

Have you asked this question to yourself ? Here a quick, not exhaustive answer (thanks to this)

Schema of a typical DHCP session

Client DISCOVERY (broadcast) –>
<– Server OFFER (unicast)
Client REQUEST (broadcast) –>
<– Server ACKNOWLEDGE (unicast)

DHCP uses the same two IANA assigned ports as BOOTP: 67/udp for the server side, and 68/udp for the client side.
DHCP operations fall into four basic phases. These phases are IP lease request, IP lease offer, IP lease selection, and IP lease acknowledgement.
After the client obtained an IP address, the client may start an address resolution query to prevent IP conflicts caused by address poll overlapping of DHCP servers.

DHCP discovery

The client broadcasts on the local physical subnet to find available servers. A client can also request its last-known IP address. If the client is still in a network where this IP is valid, the server might grant the request. Otherwise, it depends whether the server is set up as authoritative or not. An authoritative server will deny the request, making the client ask for a new IP immediately. A non-authoritative server simply ignores the request, leading to an implementation dependent time out for the client to give up on the request and ask for a new IP.

DHCP offers

When a DHCP server receives an IP lease request from a client, it extends an IP lease offer. This is done by reserving an IP address for the client and sending a DHCPOFFER message across the network to the client. This message contains the client’s MAC address, followed by the IP address that the server is offering, the subnet mask, the lease duration, and the IP address of the DHCP server making the offer.The server determines the configuration, based on the client’s hardware address as specified in the CHADDR field.

DHCP requests

When the client PC receives an IP lease offer, it must tell all the other DHCP servers that it has accepted an offer. To do this, the client broadcasts a DHCPREQUEST message containing the IP address of the server that made the offer. When the other DHCP servers receive this message, they withdraw any offers that they might have made to the client. They then return the address that they had reserved for the client back to the pool of valid addresses that they can offer to another computer. Any number of DHCP servers can respond to an IP lease request, but the client can only accept one offer per network interface card.

DHCP acknowledgement

When the DHCP server receives the DHCPREQUEST message from the client, it initiates the final phase of the configuration process. This acknowledgement phase involves sending a DHCPACK packet to the client. This packet includes the lease duration and any other configuration information that the client might have requested. At this point, the TCP/IP configuration process is complete.
The server acknowledges the request and sends the acknowledgement to the client. The system as a whole expects the client to configure its network interface with the supplied options.

Posted in Uncategorized

Thursday, January 2nd

Scaling Pinterest

This is an infoq video where Marty Weiner and Yash Nelapati talk about decisions they took during their journey from the beginning up to now, I’ve found it very interesting because they highlighted some very concepts having real and relevant impacts despite their triviality.

Apache Mesos

Watching another video at infoq, Apache mesos have been mentioned and then let’s tale a quick look. As from the site, Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes. It is a distributed computing platform or we could think of it as a sort of distributed OS. It implements a Master/Slave architecture and has the following components:

  • Master(s):one master is elected (Zookeepr cluster) among available masters. Master doesn’t do much, it mainly manages resources (CPU, memory, …), launches tasks on slaves, forwards status messages between tasks and framework
  • Slave(s): it monitors individual tasks and reports status to the master, ensures that tasks don’t exceed resource limits. It executes tasks submitted by frameworks.
  • Framework(s): it is for instance your application, it receives resource offers from master and launches tasks.

Example of frameworks are:

Pyres – a Resque clone

Resque is a great implementation of a job queue by the people at github, unfortunately 😛 it’s written in Ruby and someone who works in python ported the code to python creating PyRes.You can put jobs (that can be any kind of class) on a queue and process them while watching the progress via your browser.

Pandas cookbook

Pandas is a python library for doing data analysis, it is dast and lets you do exploratory work really quicky. This is a cookbook that gives you some concrete examples for getting started with pandas.

Some datasets

Here you can find a list of available dataset for download, I hope they can be useful 😛

Docker is an open source project to pack, ship and run any application as a lightweight container. Some people in my office pointed me to this project and it seems quite interesting. Let’s start trying to better understand what it really is.

This is a short description from the site

Docker containers are both hardware-agnostic and platform-agnostic. This means that they can run anywhere, from your laptop to the largest EC2 compute instance and everything in between – and they don’t require that you use a particular language, framework or packaging system. That makes them great building blocks for deploying and scaling web apps, databases and backend services without depending on a particular stack or provider.

Typically you can distribute applications and sandbox their execution using a virtual machines, for instance VMWare, Oracle VirtualBox and Amazon EC2 ami. Using this solution a developer should be allowed to package its application and distribute / depoly it with little effort. In practice it does not happen mainly for these reasons:

  • Size: they may be very large and thus difficult to store and transfer
  • Performance
  • Portability: one VM instance does not play very well with competitor solutions
  • HW-centric

By contrast, Docker relies on a different sandboxing method known as containerization. Unlike traditional virtualization, containerization takes place at the kernel level.
Docker builds on top of these low-level primitives to offer developers a portable format and runtime environment that solves all 4 problems.
Docker containers are small (and their transfer can be optimized with layers), they have basically zero memory and cpu overhead, they are completely portable and are designed from the ground up with an application-centric design. In addition because docker operates at the OS level, it can still be run inside a VM!

JavaScript Patterns Collection

A JavaScript pattern and antipattern collection that covers function patterns, jQuery patterns, jQuery plugin patterns, design patterns, general patterns, literals and constructor patterns, object creation patterns, code reuse patterns, DOM and browser patterns.

Posted in Dataset, Python, Uncategorized

Wednesday, January 1st

Constrained programming

As from Wikipedia, constraint programming is:

” In computer scienceconstraint programming is a programming paradigm wherein relations between variables are stated in the form of constraints. Constraints differ from the common primitives of imperative programming languages in that they do not specify a step or sequence of steps to execute, but rather the properties of a solution to be found. This makes constraint programming a form of declarative programming. The constraints used in constraint programming are of various kinds: those used in constraint satisfaction problems (e.g. “A or B is true”), those solved by the simplex algorithm (e.g. “x ≤ 5″), and others. ”

Constrains differ from the common primitives of other programming languages in that they do not specify one or more steps to execute but rather the properties of a solution to be found. This concept came out during an interesting video related to how to define business rules and business rule logic using Groovy and DSL, so I found this groovy documentation page listing different libraries helping solve this kind of problems.

Whoosh – a python search library

This is the description extracted from here: ” Whoosh is a fast, pure Python search engine library. The primary design impetus of Whoosh is that it is pure Python. You should be able to use Whoosh anywhere you can use Python, no compiler or Java required. Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine”. If you ever had a chance to work with lucene or something like Solr or Elasticsearch you’ll find lots of common concepts and models, it should quite easy to use this library, let’s see if I’ll have any project that may require it.

Titan – a graph database

Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals. It is an awesome library / framework to work with graph in a scalable manner. You should take into account that ThinkerPop stack and more important it can configure Cassandra, HBase or Oracle Berkley DB as storage and Elasticsearch as search provider. This article reports how to setup an AWS titen solution. I’ll spend some of my spare time working with Titan mainly on Cassandra. I’ll keep you … posted 😛 I was forgetting a pleasing idea from Titan, it has been defined to work at realtime, for batch or analytics engine Faunus has been provided. Faunus is a Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster. I hope to be able to test also this feature 🙂

Posted in Uncategorized

MongoDB – Part I

What is MongoDB ?

In short, extracted exactly from MongoDB site, MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.

Document Database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects.

Advantages of using documents:

  • Documents (i.e. objects) correspond to native data types in many programming language.
  • Embedded documents and arrays reduce need for expensive joins.
  • Dynamic schema supports fluent polymorphism.

High Performance

MongoDB provides high performance data persistence.

  • Support for embedded data models reduces I/O activity on database system.
  • Indexes support faster queries and can include keys from embedded documents and arrays.

High Availability

MongoDB provides:

  • automatic failover.
  • data redundancy.

throught replica set. A replica set is a group of MongoDB servers that maintain the same data set, providing redundancy and increasing data availability.

Automatic Scaling

MongoDB provides horizontal scalability as part of its core functionality.

  • Automatic sharding distributes data across a cluster of machines.
  • Replica sets can provide eventually-consistent reads for low-latency high throughput deployments.


I will not dig into this task, you can find all the required information here for your preferred OS.

First steps

Once installed you must used it !! So let’s start with some basic command to check installation and become familiar with MongoDB.
I’ll use the mongo shall tool to connecto to MongoDB server and execute commands.

We can connect to MongoDB server starting mongodb shell, by default it lookps for a server listening at port 27017 on the localhost interface. You can connecto to a different port and host providing the –port and –host parameters.
Once connected you can execute:

  • db command to report the name of the current database
  • show dbs to display the list of databases
  • use <db name> to switch to the <db name> database

You must be aware that MongoDB does not create a database until some data is inserted.

One of the most useful function is help, it can be used :

  1. alone to get a quick informative page on how help command can be used
  2. as fisrt command followed by connect, keys, etc … to describe each command
  3. as method like .help() appended to some javascript, cursor, db and db.collection to get additional information

Let’s create our first document into MongoDB; first of all check with db command that we are using the mydb database, if not execute the command use mydb. Create two documents a and b like

a = { "firstname" : "davide", "lastname" : "brambilla }
b = { "x" : 3, "y" : -1 }

next step is to insert these document into the testData collection (we will see soon what a collection is) as follow

db.testData.insert( a )
db.testData.insert( b )

as mentioned before, at first insert both the database and the collection is created. Last steps are to confirm that testData collection exists and documents have been added to this collection, we can execute

show collections: which shows all the available collection in the database
db.testData.find(): which returns the set of document contained into testData collection
Posted in MongoDB

Monday, December 2nd, 2013

Understanding the Costs of Versioning an API (or a Service)

At first sight it may seem a simple problem, you may think if I provide a new cool API why don’t they want to use it instead of the old one ? Unfortunately when you provide a service in for a production environment it is not so easy for yout customer to swtich even a simple API call ( or he/she does not want to !! ). This article evaluates differents alternatives, if you have never provided a production service it could be very enlightening.


Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Here you can find a sample project using scrapy. Enjoy it !!

Hadoop on Raspberry Pi

I’very interested in Hadoop and I played with local and pseudo-distributed configurations, but as the author of this article, I’d like to set up a real distributed hadoop cluster. You can create some virtual machines, it is the easiest way but on a single laptop may be not so easy, or you can use cloud services (maybe too expensive ?) or group a second / third had PCs and set up a cluster (I need space !!!). But wait what about Raspberry Pi ? This article is the first of a three, good luck  and let’s see if they succeed.

How to install Python 2.7 and 3.3 on CentOS 6

CentOS 6 ships with python 2.6.6 and you can update python version to 2.7 or even 3.x, unfortunately some critical apps depend on python 2.6.6 (for instance yum) 😦
This post shows how to install python 2.7 or 3.x without touching the required python version.

Tracing the History of N.C.A.A. Conferences

This post is for one of my best friends, that like me he loves the american basketball, and not only the NBA but also NCAA. I can’t remember how many NBA final games and Final Fours, Big ten, Big 12, A.C.C we watched togheter; not to mention that it is also an intriguing visualization solution 😛


Posted in Uncategorized

First steps with Cassandra

First steps with Apache Cassandra (C*)

According to Wikipedia Apache Cassandra is

an open sourcedistributeddatabase management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication allowing low latency operations for all clients.

As first steps let’s dig into data model

Cassandra data model

There are 5 key elements that we need to investigate to describe Cassandra data model:


A column is the most basic unit of data structure in the Cassandra data model. A column is a triplet of a name, a value, and a clock, which you can be thought of as a timestamp. We can imagine something like the JSON below:
 "name" : "Davide",
 "value" : "",
 "timestamp" : 1385844381000

When you’re working with Cassandra, you may choose to use wide or skinny rows. A wide row means a row that has lots and lots (perhaps tens of thousands or even millions) of columns. A skinny row means a row that has a small number of columns, something closer to a relational model. Wide rows are typically used to store lists of things, for instance they can be used to store the list of user actions on your catalogue items. Skinny rows are more similar to traditional RDBMS rows, they contain similar sets of column names; the main difference is that in Cassandra columns are optional. The basic structure of a super column is its name and the set of columns it stores, its columns are held as a map whose keys are the column names and whose values are the columns. Each column family is stored on disk in its own separate file. So to optimize performance, it’s important to keep columns that you are likely to query together in the same column family. Column Family A column family is a container for an ordered collection of rows, each of which is itself an ordered collection of columns. You may think to column family as RDBMS table but there are some differences. First, column families are defined but you may have different columns on each single row, so there is not a strict schema as in a RDBMS. Second a column familiy has two attributes: name and a comparator which is used to sort columns when they are returned in a query result.

Super Column

A super column is a special kind of columnm, but wherease a regular column stores a byte array as its value, a super column stores a map of subcolumns as its value. You must be aware that the super column structure goes only one level deeper, you cannot define a super column that stores anoter super column. You must be aware that when modeling with super columns, Cassandra does not index subcolumns, so when you load a super column into memory, all of its columns are loaded as well. In this case you may think to define a composite key, for instance something like “contentid:insertts”. Note that  super columns are not supported in CQL 3.

Super Column Family

If you want to create a group of related columns, that is, add another dimension on top of column. Note that  super columns family are not supported in CQL 3.


A keyspace is the outermost container for data in Cassandra, it can be thoughy as a relational database.  A keyspace has a name and a set of attributes that define keyspace-wide behavior. You can create as many keyspaces as your application needs. The basic attributes that you can set per keyspace are:

  • Replication factor:  number of nodes that will act as copies (replicas) of each row of data
  • Replica placement strategy:  how the replicas will be placed in the ring.
  • Column families: keyspace is a container for a list of one or more column families. Each keyspace has at least one and often many column families.

Generally its not recommended to create more than a single keyspace per application, it could be useful to define different keyspaces when you need to specify different repliacation options.

Tagged with: ,
Posted in Apache Cassandra

Sunday, December 1st, 2013

Gruff: A Grapher-Based Triple-Store Browser for AllegroGraph software

This morning I was watching this video about graph databases and I found Gruff: A Grapher-Based Triple-Store Browser for AllegroGraph software from Franz Inc.
Investigating a little further I found that this software works only for Allegro Graph, however it is a intresting solution to evaluate and explore.

Next Generation Databases

This video from Emil Eifrem is very quick but quite interesting overview of NoSQL databases. Emil is CEO of Neo Technology company, however his speech is quite unbiased.

The 2013 Daily chart Advent calendar

December remembers me when I was child and during this period I was looking forward to waking up and grub chocolate from my advent calendar.  Now I’ve grown up and unfortunately I can’t have chocolate for a while so I found an electronic daily advent calendar that every day until Christmas shows a collection of the 24 most popular maps, charts, data visualisations and interactive features published on the Economist site the last 12 months. Christmas Eve the most popular infographic of 2013 will be revealed. In addition a Christmas gift behind door number 25.

Krona a Hierarchical data browser

Krona allows hierarchical data to be explored with zoomable pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The charts can be viewed with a recent version of any major web browser (seeBrowser support). Take a look at this example.


This morning I came across InfiniSQL, it is described as

Extreme Scale Transaction Processing.
InfiniSQL is the database for always on, rapid growth applications that need to collect and analyze in real time–even for complex transactions.

InfiniSQL is a relational database management system (RDBMS) composed entirely from the ground up. InfiniSQL’s goals are: Horizontal Scalability, Continuous Availability, High Throughput, Low Latency, High Performance For Complex, Multi-Host Transactions, Ubiquity. According to the information provided it has been tested to support over 500,000 complex transactions per second with over 100,000 simultaneous connections on a cluster of only 12 single socket x86-64 servers. Let’s give it a try 😉

KaHIP – Karlsruhe High Quality Partitioning

I always wondering how you could efficiently split a graph into partitions for instance to be able to scale out  graph databases (maybe someone has already done it 🙂 ), you should be able to found a division of a graph’s node set into k equally sized blocks such that the number of edges that run between the blocks is minimized; this is exactly what KaHIP does.

Machine Learning Video Library

In this page you can find lots of videos related to different machine learning algorithms. I’m personally interested in SVM (Support Vector Machine) so let’s start !!

Over 1000 D3.js Examples and Demos

How to start with D3.js? Obviously read the documentation first (yes, sure :P) but I found that it is very useful and easy to start from the ridiculous amount of awesome demos and codes available online. Here you can find more than a thousand examples and demos. You can also referr to d3.js official examples page here.

Machine learning – Support Vector Machine

Watching this video Sentiment analysis using support vector machine (even if it has been registers in Ruby conference 😉 ) I’ve found some interesting links about machine learning and SVM:


If you are interested in purely functional language you may want to take a look at this set of videos tutorial about Haskell.

Tagged with:
Posted in Day By Day