Rapidminer might be great, but not for me

I found RapidMiner hard going and cumbersome. The free trial version just took ages to open und seemed to have clogged up my computer like a fatberg the sewage system. So I wasn’t really feeling positive towards the programme that promises to make everyone into a unicorn when I started to look at it. And I still don’t.

The software’s own documentation seems to have been done by experts for experts. The tutorial leads you through step by step but seems to explain very little. So it is near impossible to use the information from the tutorial with your own data set. There is no easily available help if you are stuck, and error messages will give you little indication on how to resolve the issue. I found RapidMiner unintuitive, and came to detest it during the many, many, many hours I tried to get it to work for me.

Anyway, so that not all the effort was lot, and mainly because it is part of the assignment I have put together this Blog post about my frustrating journey.

After having done the tutorial on survival in the Titanic disaster, and sources a data set that seems to suggest similar questions (census data from the States), I decided to go for a Decision Tree, only to re-read the assignment brief and to start to be doubtful if a Decision Tree would an acceptable output. By this point this was fine by me, because I didn’t really understand what I was doing. Even through the data invited a decision tree (what are the factors leading to an income over a certain amount – I would have easily made a business case for a gold digger.) The data I looked at was based on the 1994 Census in the States (http://www.census.gov/ftp/pub/DES/www/welcome.html), and it was kindly provided by Ronny Kohavi and Barry Becker and is available on http://archive.ics.uci.edu/ml/datasets/Adult .

 

Screen Shot 2016-06-18 at 16.35.30 Screen Shot 2016-06-18 at 16.37.33 Screen Shot 2016-06-18 at 16.45.32 Screen Shot 2016-06-18 at 16.45.35

Right – as I said this didn’t make any sense. I tried ANOVA to similar effect with a similarly high time investment without any learning effect but a lot of frustration. By that time, I wished I would have just used SPSS. But I wasn’t quite sure if that’d counted for the assignment.

I looked at rapidminer in the context of an assignment for a data analytics module. The task was open enough:  To apply the CRISP-DM methodology to analyse the an open data set of one’s own choosing. The analysis should be either regression analysis, analysis of variance (Anova) or time series analysis. The assignment submission should be in form of a report and of a blog post.

So I followed the tips provided in a video by Thomas Ott on S&P 500 data for time series forecasting. There seem to be very few tutorials on Rapidminer, and I am very grateful for the work and generosity that has gone into the  tutorial by Thomas Ott.

The data I chose was  provided on through the site http://www.bankofengland.co.uk/research/Pages/onebank/threecenturies.aspx as I assumed I would make my life easier to use similar data. The data is described on the website as follows: “The spreadsheet is organised into two parts. The first contains a broad set of annual data covering the UK national accounts and other financial and macroeconomic data stretching back in some cases to the late 17th century. The second section covers the available monthly and quarterly data for the UK to facilitate higher frequency analysis on the macroeconomy and the financial system. The spreadsheet attempts to provide continuous historical time series for most variables up to the present day by making various assumptions about how to link the historical components together. But we also have provided the various chains of raw historical data and retained all our calculations in the spreadsheet so that the method of calculating the continuous times series is clear and users can construct their own composite estimates by using different linking procedures.” 

The tutorial video emphasized that the advantage of the approach is that it combines the advantages of machine learning for forecasting with conventional forecasting algorithms.

The process consisted of three steps: to set up windowing, to train the model, and to evaluate the forecasts. As the assignment recommended to follow the CRISP-DM model

crsip

There were some additional steps in preparation to employ the crisp model that were not in the tutorial.

Step 1: consisted of gaining business understanding, data understanding and data preparation

Business Understanding

There was no concrete business case as it was an assignment. So the questions ask of the data were regarding general information over a long period of time to gain insight if there are any regularities that can be exploited in the future.

Data Understanding

This step sounded innocuous enough but turned out to be the rail stumbling block. Rapidminer seems to use terms that are not the same that I recall from my statistics module for my undergrad degree. Additionally, I found that most open data sets were from areas that I had not been exposed to – e.g. cancer cell growth, economic data, traffic data, air pollution – what have you not. So I followed the tutorial with the data set I chose and the questions I had in my head and understood little.

Data Preparation

Data was imported into Rapidminer. Data type was adapted. Examples with missing values have been excluded. All this was followed the suggestions on the tutorial.

Step 2.1: Modelling

The suggestions was to use ‘ Windowing ‘ . The process screenshot shows the setting up of the operators for windowing. The data set that is a data can be used as ID. This is what the ‘Set Role’ operator is used for. Select attributes is used to see which data should be used to be inputted. And the actual Windowing operator comes last, to achieve the windowing.

1 windowing

In the windowing process the horizon determines how far out to make the prediction. In the example the window size is three, and the horizon is one. this means that the fourth row of the time series becomes the first label.

The window size determines how many attributes are created for cross sectional data. That means that each example in the window will become an attribute.

Step 2.2: Train the model

The next step is to use the ‘Sliding Window Validation’ operator. This is a nested operator, meaning that the first window is nested inside the ‘apply model’ and ‘performance’ windows. To use it like this makes it possible to use all kind of machine learning algorithms (e.g. regression and neural network). This is how machine learning can be used to improve the quality of the prediction by improving its accuracy.

2 validation

Step 3. Evaluation

We run the process in step 2 with the test set that we put apart before, and connect them, so that the generated model can be applied by using the ‘Apply Model’ operator. This compares all the things where the headers in the two sets are identical.

From the resulting plot we can evaluate how good the forecast is.

By this point I had spent a lot of time and a lot of effort on Rapidminer, only to find my efforts being in vain. I was expecting a graph and stats that will give me answers to my questions. I graph like this:

graph

But this one I got only by just changing the data randomly and without understanding until I got SOME output.

Deployment

The model did not get deployed, I got no useful answers to any of the questions that I had thought of finding from the data. I had spent so much time by then on Rapidminer that there was no more time to check in R or SPSS as an alternative that I would have found more comfortable with.

 

 

Here are some screenshots from the preparation:

validation 3

 

0003 0004 0004_1 0005 Screen Shot 2016-06-18 at 15.52.34 Screen Shot 2016-06-18 at 15.52.37 Screen Shot 2016-06-18 at 15.52.42 Screen Shot 2016-06-18 at 15.52.46 Screen Shot 2016-06-18 at 15.52.54 Screen Shot 2016-06-18 at 15.52.59 Screen Shot 2016-06-18 at 16.32.56


BlockChain and Analogies

As somebody interested in science communication emerging technologies are very exciting to me. It seems that I missed the emergence of the internet, biotech and nano – by the time I realized they are there, they were really there.

images

So when I heard the first time about block chains I was really interested. I had missed their first emergence in cryptocurrencies – however, it seems that only now its possible application outside the financial sector become apparent. Blockchains, it is claimed, are laying the groundwork for the next century’s economic growth – outgrowing its uses in the finance sector to encompass the whole society.

paul_baran_1962_distributed-80dc7f722a344d6d84556e990ff16018
So,
Blockchains are distributed databases,
They have funky attributes. They are autonomous, meaning they run on their own. They are absolutely durable because they are copied across thousands of computers it would be able to fully rebuild itself even if most of the computers would be taken offline. They are said to be secure, as it is open source, and therefore ‘cryptographically auditable’, which means you can be mathematically certain that their entries have not been manipulated. This together with their open use policy will make them ubiquitous. Anyone can audit the code.
So what uses can they have outside the money business? Well, you could us them to have a time stamped proof of your copy right or idea, you could use them to sign and adapt contracts.

This makes me think that I read about a special contract that Estonia seems to consider: Marriage on Blockchain
http://bravenewcoin.com/news/bitnation-starts-offering-blockchain-public-notary-service-to-estonian-e-residents/

This made me realize just how difficult it is to understand this concept of data bases. I believe it is partly because the first data bases are compared to accounting ledgers. This analogy worked probably well for data bases such as Excel. But does it really work for others? I always imagined databases to be data that is physically stored somewhere like old leather bound books in the library, and data that’s there to tell you where to find the data, like the card register in the library.
index

But this analogy doesn’t work at all for Blockchains. This would be register cards left on pieces of books telling you were to find the rest of the book or the next volume. And these books, or pieces of books would not be in one library, but the same pieces of books would be distributed through many libraries, constantly copied and constantly added onto new pieces of books with registration card refering to the next piece of books. That would not work at all as a system of organising a library.

SONY DSC

I was looking for different metaphors that suited this model more, and was quite happy to think of the brain and how it forms memories (which is in a way ironic if you think that one of the problems in understanding the brain was that it used the inappropriate model of linear computers)

So now, we’ll see what’s emerging.Bitcoin-engagement-ring

Towards becoming a Data Scientist

The term data scientist seems to mean so many different things to different people. I was thinking it might suit me before looking into what it actually means: After all, I am interested in statistics, behaviour, machine learning and UX evaluation. Looking at job advertisements for data scientist and data analysist showed just how wide the range of job descriptions as well as the range of requirements can be. It seemed to include everything, from jobs where the main task is data entry to jobs where programming is the main task.

I found the infographic on datacamp helpful and decided to take it as my main guideline what skills in what area to acquire.

 

However, more detailed research was needed.

To find out what is involved, research on the sharp end was needed:

On the 27/03/2016 a search for “data scientist” brought up 23 results on irishjobs.ie and 275 on jobs.ie.

These sites were chosen because of their high rankings. Both sites are very popular in Ireland’s job market. The site is ranked 97, and 103 respectively in Ireland by http://www.alexa.com. It was therefore decided to focus on irishjobs.ie because its ranking was higher.

 alexa1

alexa2

Of the 23 hits, most were by recruitment agencies. These were not further considered, because it cannot be seen if they are advertising for the same company, which would lead to duplication. This left 8 adverts of which on further needed to be discarded, as it was the same advertisement twice published on different dates.

Of the remaining 7 adds that were life on that day,

6 required a relevant degree in either math (including statistics), computer science or engineering, half mentioned PhD level, although that was not an essential requirement. Apart from specific requirements for the job profile (e.g. experience in customer facing roles) all of the remaining adds want relevant experience from a minimum of 3 years to a minimum of 6 years.

Knowledge that was required was

Agile software (2x)

Other Programming (3x)

SAS (3x)

R (3 x)

Hadoop (3x)

SQL (3x)

Python (2x)

Machine Learning (3x)

 

Furthermore, on wanted web skills, such as HTML, CSS, JavaScript and PHP, another company was interested in SPSS, , Scala,, the L language, SQL, OLAP, MDX.

These seemed an interesting starting point to compare it with the infographics from datacamp.com

I was astonished to find only one company requiring basic webskills. However, with the raise of coding as a subject in schools in the UK, US, India etc it might be that they are not explicitly required because by now it is just assumed that these skills are given – a bit like not explicitly asking for basic literacy.

The heavy focus on mathematics and statics however, did make me hopeful. For one, it indicates and interesting field of work. Secondly, as long as mathematics still has the image of being a difficult thing that only a few chose can acquire, the amount of people who chose to delve into[i] it further will stay limited, enabling mediocre people to find work – the prospect of which makes my mediocre heart beat faster. This in addition to know that pen courses in mathematics can be fairly cheap.

It is also interesting to know that the chart seems to imply that a firm footing in traditional research will be further in demand. This is also supported by the adds that were used to give the insight into practice, that often mentioned Masters or PhD degrees, and according to the infographic nearly 10% of working data scientists have a PhD (compared with a about 3% of the general population in the US.)

The chart suggests ‘hacking skills’ as a major factor. I assume this is deliberately described in lose terms. The adds asked occasionally for experience or knowledge of specific languages or programming skills, but it seems that in general, once people get to grips with ANY programming language thoroughly, the acquisition of new or related languages is just a question of a bit of additional work.

 

Interestingly from this sample of adds (which is an unscientific sample of convenience) the demand for knowledge of database construction seems to be secondary, although there is a strong demand for SQL knowledge, and one add asked for Cassandra.

By looking at these two sources, the sample from the job add at a specific date and the chart made by datacamp.com shows a very interesting picture of what is involved in being a data scientist as well as to get there. The focus on math is interesting. As the technology and tools seems to move so quickly, it seems that mathematics is a tool or area of knowledge that is unchanging, and therefore worth investing time and effort in.

The other area that seems worth focusing on are longstanding programming languages such as python and Java. They should put you on a solid foundation.

SQL seems a useful tool to know, for work as a data scientist as well as for many other areas of work.

Interesting but not surprising is the complete absence of knowledge of digital marketing or management from both sources the infographic and the adds. The course in ‘Big Data’ seems to focus on these areas equally to the math and database. The way these areas are assessed makes it easy to get distracted by the areas that  are worth focusing on and these modules whose part in a big data course can only be guessed.

So for the continuation on the path to become a data scientist, for my Python will be the next area to look at, plus I finally want to enrol in the BSc in Mathematics degree, that I was looking at for years.

 

Another CA – Try R, and so I did

I have been using SPSS. A lot. Like most people using it, there are parts with which I was confident and content, and other parts (most of SPSS) that where guess work or following a cookbook in the hope that what I am doing is not outlandishly and obviously wrong. I had heard of R for a long time but it never seemed necessary to have  a closer look at it. That changed with the part-time course I’m attending, where the task were:

To complete the free course on codeschool.com on R: Try R, evidenced by a screenshot, like this one:

RRRR

Furthermore, the requirements were :

“Based on this course, please create an example use case based on some data that you have created.

Use the R Graphics to visualise the data.”

I have run out of ideas. I had big plans: collecting data on the influence of films on popular baby names, the influence of popular baby names on popular pet names, the influence of names on academic achievement (I seem to remember that students whose name start with A get more often A’s than other students, and that people with the first name ‘Kevin’ do worse than statistically probable in Germany while the reverse is true for ‘Sofias’. But as much interest I had in these areas (and fully aware that this will not be a case of ‘if-you’re-interested-in-it-it’s-going-to-be-interesting-to-somebody-else’ i decided to have a go on some simple data.

If I find something interesting to count in the office on Monday, I will. If it’s not covered by the confidentiality agreement I signed, then I will have a go at R, and see if I uncover something amazing. It might involve the canteen options, or the number of sneezes, or the amount of swearing around me.

But until then, this is a bar chart of the high temperatures in January in Dublin

barplot hi

It can be seen that it was pretty warm overall in January  with spring like temperatures towards the end, and a little slump in the middle of January. On the 30th a surprise slump occurred. When I looked at the low temperatures for January I really was surprised to find that according to accuweather.com there was only one day when the temperatures went below zero.

barplot lo

I am not sure where the frost on the roofs I could see in the mornings came from, but it is also true that my geraniums have not died and a yucca on the balcony seems pretty perky. So maybe it is true.

I also took bar charts of the historic low and high, but they are pretty flat – as would seem obvious with averages taken over a long period of time.

barplot his_hi barplot his_lo

It would be lovely to see though, if there really is a slump for the “ice saints” namely the ‘Cold Sophie’ (probably one of the Germans academically excelling, a group of three or so days in May that are said to be much colder than all the surrounding days, and maybe that’s what I will do next. Or really see that since it doesn’t rain in Ireland as much as it used to (really?) it rains more on the continent.

Google Fusion Tables

The task sounded easy at first:

To create a heatmap of the population in Ireland in Google Fusion Tables, describing the process in a blog post and saying a few words about the information gained.

Google apps are usually self-explanatory, self-guided, easy and fun to use. I was interested in an easy-to-use visualization tool to facilitate rudimentary data analysis. However, my hope of playing with a bit of data and visualizing surprising or random correlation was thwarted – mainly by the realization how difficult the process is and how deadlines always approach when you nearly got it.

My first difficulty was to not understand where to get the raw data from. I assumed, as maps are a particular strengths of google, that it will be easy to get a Irish county map from google. And it might be, but I could not find it. I also had a long look at huge data sets on the CSO site – so huge in fact, that I couldn’t see the wood for the trees, or in this case the tree for the woods, as it was so much incredible detailed data (age, gender, religion) that I wasn’t sure about the overall population of a county.

The counties themselves turned also out to be difficult – I never really understood the Ulster, Munster etc thing and how they relate to Clare, and Donegal etc. I never even understood if Dublin is it’s own county or not. Now I know.

So I needed two files, one a map of the counties, one a list of the counties with the associated population. I found a blog of a former participant (Thank you, Brian for sharing your advise) saying that he used a map in *.kml format from the irishindependent.ie website, and his population data from a summary table on the census website. I uploaded these onto the google drive, and merged them in the google fusion app with the merge function, and all kind of things happened but never the ones I wanted it to do. The county borders weren’t shown, or only the county borders were shown without any other information. Some of the heatspots were in the UK, one in the US, and one somewhere in the north of Germany (where, I know now, is a place called Carlow).

1 carlow is in germanyEven with my  rudimentary knowledge of Irish geography I thought it unlikely that Ireland expands that far and I needed to find out how to clean up my data (I cannot exclude the possibility, however, that there is a large Irish community in the north of Germany, as there seem to be a high  density of Irish pubs). I managed to figure out how to correct the Louth and Longford from the UK to the Isle of Ireland. I realized that in my original file a left-over (undeleted) “or” was understood by the app to mean OR for Oregon, which produced the heat-spot in the US.

1 Or is it in oregon

I also realized that there is two spellings for Laois, which shouldn’t have surprised me as the Irish always like to put in a few extra letters if there is space, so I aligned all the spellings.

As still nothing happened, I did some more research, to find that I failed by not merging them not by the matching columns (by name of county) but somehow to matched a description with the name of county. A mistake that shouldn’t have happened as we learnt about foreign and primary keys in the data base course.

Then I just needed to go through the options, and after trying a series of different colours, I decided to go for a gradient.

Uploading and merging the relatively small data sets takes a fairly long time, but I guess it takes a lot shorter and having to produce vectorized maps yourself.

The first map I produced revealed the information that Dublin is the area with the densest population, followed by Cork. I found this information rather non-revealing as most of the country seemed to be the same shade, and Dublin seems to be the outlier.

2 alles pale I made an inverse heat map where areas that were least populated were the most ‘outstanding’ . However, with the expectation how  a heatmap worked this turned out to be confusing.

3 focus on countrysideSo, in line with the traditional heat maps, I went back to the reddish colours for the densely populated areas, but changed the gradient to a finer gradient for the less populated areas.

And this is my final, hopefully clickable Google Fusion Heat Map:

final

The information yielded is that most of Ireland is sparsely populated, with the exception of Dublin and Cork.

Next steps for information would be to find historical data and see changes over time.