Wednesday, January 6, 2010

Now I know how to mine data

There's just too many techniques on how you can do data mining. In grad school we only got to know 1 technique - by using artificial neural network. There's a tool for this - Weka. Download and install it or sudo apt-get it (yes imagine my surprise when I found out about this).

What's data mining?

Well, basically it's a process to get a knowledge (eg. if customers buy diapers, most likely they will buy beers too) from your data (imagine spreadsheets filled with numbers and strings). You may have thousands of records but how do you make sense of it? You mine those data, and get a pattern. From there you'll get a knowledge. So from this you can put beers next to diapers in supermarkets to increase sales. That's the general idea.

There's a 90% chance that the data you got is not cleaned aka there's missing values, some data are not consistent (eg. in sex can only have F or M, there's a value Q?), and some data are just plain rubbish. The data need to go through preprocessing stage. For missing values, you have to filled them up, either using mean or median values, whichever is best for your data. The same goes to inconsistent data. This is where working with experts in the domain you're working in is very important. You don't want to remove what you think is rubbish but it actually means something.

Then we have data discretization process where you reduce that huge amount of data but they still carry the same value. Afterwards we normalize the data. After this is done, then the data is ready for the modelling exercise.


One of the benefit you get from data mining is classification. It's where you can predict if you have this data (eg. male, married, doesn't have children, have regularly purchase beer for the past year) whether beer purchase is likely. So it will classify to something like this purchase_beer equals to 0 (no) or 1 (yes). In order for the model to predict 0 or 1, the model itself has to be trained with alot of data. Train it until it reaches the accuracy we want, above 90% is good. A trained model with very high accuracy is going to be an asset as you can feed it data and it will spit out what we want.

The most difficult part in doing this for me is the data preprocessing part. You have to have quality cleaned data to produce quality results. You have to carefully select which kind of data is relevant to your goal (eg. would you want to include one's job as one of the attributes considered for the diapers-beers example?) which is why having domain experts is important. They also have to determine how each data should affect the result (eg. job probably affects 10% but marital status affects 80% towards diapers-beers purchase).

I just love AI

There are just so many preprocessing methods & data mining hybrid techniques already been researched by academicians. I'm just so overwhelmed by the amount of technical papers on this. They probably not so much IT savvy like us, the implementors as they call it, but they definitely have the brain on that part of the world. We just have to scour through this massive database and get it to run on our apps. Well, should you need it that is. Coz processing thousands of data can take hours or days, some months, depending on your machine spec.

Open data

While doing data mining assignment last 2 weeks, I became frustrated with the unavailability of data in Malaysia. Sad. Maybe it's still difficult for us to see it now, I'm already imagining all the stuff we can do with those data like in JPA, MOHR, MOH, MOHE - fuhhhh I'm all shuddery now. Even in OSCC, those training feedback forms and MyGOSSCON feedback forms - hmm whatever happened to those?

There are concerns of exposing private data I guess. Well, if you ask me make the data anonymous, as I couldn't care less who got promoted last year. I only want the 'spec' of that person who got promoted - age, sex, location, department, salary, is he respected by peers, does he drive, etc, that kind of stuff, you get the idea.

Hmm.. this will take another 10 years to realise, I think.


  1. Abdullah Zainul Abidin7/1/10 1:13 AM

    Wow.. that is interesting stuff...Makes me feel like I want to continue my studies too.. >.<

  2. Is it accurate to say that we are required to advise? I don't get notification's meaning?
    data hk

  3. Pasarqq tempat Bermain judi bandarq online tentunya menjadi di antara pilihan yang tidak sedikit dipilih oleh pemain website bandarq .Bisa anda lihat profile terkait pasarqq ternama melalui metode klik link , lalu kemudian dan juga
    Selain itu juga pasarqq ini memiliki homepage personal seperti

  4. Poker online situs terbaik yang kini dapat dimainkan seperti Bandar Poker yang menyediakan beberapa situs lainnya seperti , kemudian,, dan yang paling akhir yaitu Jangan lupa mendaftar di panenqq

  5. Очень потрясающиеи к тому же поразительно точнейшие онлайн гадания для предсказания своего будущего - это непременно то, что поразит тебя на страницах нашего сайта. Гадание по часам оказывается действенным приятным и легким способом для получения важных информаций из ментального поля земли.