I found RapidMiner hard going and cumbersome. The free trial version just took ages to open und seemed to have clogged up my computer like a fatberg the sewage system. So I wasn’t really feeling positive towards the programme that promises to make everyone into a unicorn when I started to look at it. And I still don’t.
The software’s own documentation seems to have been done by experts for experts. The tutorial leads you through step by step but seems to explain very little. So it is near impossible to use the information from the tutorial with your own data set. There is no easily available help if you are stuck, and error messages will give you little indication on how to resolve the issue. I found RapidMiner unintuitive, and came to detest it during the many, many, many hours I tried to get it to work for me.
Anyway, so that not all the effort was lot, and mainly because it is part of the assignment I have put together this Blog post about my frustrating journey.
After having done the tutorial on survival in the Titanic disaster, and sources a data set that seems to suggest similar questions (census data from the States), I decided to go for a Decision Tree, only to re-read the assignment brief and to start to be doubtful if a Decision Tree would an acceptable output. By this point this was fine by me, because I didn’t really understand what I was doing. Even through the data invited a decision tree (what are the factors leading to an income over a certain amount – I would have easily made a business case for a gold digger.) The data I looked at was based on the 1994 Census in the States (http://www.census.gov/ftp/pub/DES/www/welcome.html), and it was kindly provided by Ronny Kohavi and Barry Becker and is available on http://archive.ics.uci.edu/ml/datasets/Adult .
Right – as I said this didn’t make any sense. I tried ANOVA to similar effect with a similarly high time investment without any learning effect but a lot of frustration. By that time, I wished I would have just used SPSS. But I wasn’t quite sure if that’d counted for the assignment.
I looked at rapidminer in the context of an assignment for a data analytics module. The task was open enough: To apply the CRISP-DM methodology to analyse the an open data set of one’s own choosing. The analysis should be either regression analysis, analysis of variance (Anova) or time series analysis. The assignment submission should be in form of a report and of a blog post.
So I followed the tips provided in a video by Thomas Ott on S&P 500 data for time series forecasting. There seem to be very few tutorials on Rapidminer, and I am very grateful for the work and generosity that has gone into the tutorial by Thomas Ott.
The data I chose was provided on through the site http://www.bankofengland.co.uk/research/Pages/onebank/threecenturies.aspx as I assumed I would make my life easier to use similar data. The data is described on the website as follows: “The spreadsheet is organised into two parts. The first contains a broad set of annual data covering the UK national accounts and other financial and macroeconomic data stretching back in some cases to the late 17th century. The second section covers the available monthly and quarterly data for the UK to facilitate higher frequency analysis on the macroeconomy and the financial system. The spreadsheet attempts to provide continuous historical time series for most variables up to the present day by making various assumptions about how to link the historical components together. But we also have provided the various chains of raw historical data and retained all our calculations in the spreadsheet so that the method of calculating the continuous times series is clear and users can construct their own composite estimates by using different linking procedures.”
The tutorial video emphasized that the advantage of the approach is that it combines the advantages of machine learning for forecasting with conventional forecasting algorithms.
The process consisted of three steps: to set up windowing, to train the model, and to evaluate the forecasts. As the assignment recommended to follow the CRISP-DM model
There were some additional steps in preparation to employ the crisp model that were not in the tutorial.
Step 1: consisted of gaining business understanding, data understanding and data preparation
There was no concrete business case as it was an assignment. So the questions ask of the data were regarding general information over a long period of time to gain insight if there are any regularities that can be exploited in the future.
This step sounded innocuous enough but turned out to be the rail stumbling block. Rapidminer seems to use terms that are not the same that I recall from my statistics module for my undergrad degree. Additionally, I found that most open data sets were from areas that I had not been exposed to – e.g. cancer cell growth, economic data, traffic data, air pollution – what have you not. So I followed the tutorial with the data set I chose and the questions I had in my head and understood little.
Data was imported into Rapidminer. Data type was adapted. Examples with missing values have been excluded. All this was followed the suggestions on the tutorial.
Step 2.1: Modelling
The suggestions was to use ‘ Windowing ‘ . The process screenshot shows the setting up of the operators for windowing. The data set that is a data can be used as ID. This is what the ‘Set Role’ operator is used for. Select attributes is used to see which data should be used to be inputted. And the actual Windowing operator comes last, to achieve the windowing.
In the windowing process the horizon determines how far out to make the prediction. In the example the window size is three, and the horizon is one. this means that the fourth row of the time series becomes the first label.
The window size determines how many attributes are created for cross sectional data. That means that each example in the window will become an attribute.
Step 2.2: Train the model
Step 3. Evaluation
We run the process in step 2 with the test set that we put apart before, and connect them, so that the generated model can be applied by using the ‘Apply Model’ operator. This compares all the things where the headers in the two sets are identical.
From the resulting plot we can evaluate how good the forecast is.
By this point I had spent a lot of time and a lot of effort on Rapidminer, only to find my efforts being in vain. I was expecting a graph and stats that will give me answers to my questions. I graph like this:
But this one I got only by just changing the data randomly and without understanding until I got SOME output.
The model did not get deployed, I got no useful answers to any of the questions that I had thought of finding from the data. I had spent so much time by then on Rapidminer that there was no more time to check in R or SPSS as an alternative that I would have found more comfortable with.
Here are some screenshots from the preparation: