Towards becoming a Data Scientist

The term data scientist seems to mean so many different things to different people. I was thinking it might suit me before looking into what it actually means: After all, I am interested in statistics, behaviour, machine learning and UX evaluation. Looking at job advertisements for data scientist and data analysist showed just how wide the range of job descriptions as well as the range of requirements can be. It seemed to include everything, from jobs where the main task is data entry to jobs where programming is the main task.

I found the infographic on datacamp helpful and decided to take it as my main guideline what skills in what area to acquire.

 

However, more detailed research was needed.

To find out what is involved, research on the sharp end was needed:

On the 27/03/2016 a search for “data scientist” brought up 23 results on irishjobs.ie and 275 on jobs.ie.

These sites were chosen because of their high rankings. Both sites are very popular in Ireland’s job market. The site is ranked 97, and 103 respectively in Ireland by http://www.alexa.com. It was therefore decided to focus on irishjobs.ie because its ranking was higher.

 alexa1

alexa2

Of the 23 hits, most were by recruitment agencies. These were not further considered, because it cannot be seen if they are advertising for the same company, which would lead to duplication. This left 8 adverts of which on further needed to be discarded, as it was the same advertisement twice published on different dates.

Of the remaining 7 adds that were life on that day,

6 required a relevant degree in either math (including statistics), computer science or engineering, half mentioned PhD level, although that was not an essential requirement. Apart from specific requirements for the job profile (e.g. experience in customer facing roles) all of the remaining adds want relevant experience from a minimum of 3 years to a minimum of 6 years.

Knowledge that was required was

Agile software (2x)

Other Programming (3x)

SAS (3x)

R (3 x)

Hadoop (3x)

SQL (3x)

Python (2x)

Machine Learning (3x)

 

Furthermore, on wanted web skills, such as HTML, CSS, JavaScript and PHP, another company was interested in SPSS, , Scala,, the L language, SQL, OLAP, MDX.

These seemed an interesting starting point to compare it with the infographics from datacamp.com

I was astonished to find only one company requiring basic webskills. However, with the raise of coding as a subject in schools in the UK, US, India etc it might be that they are not explicitly required because by now it is just assumed that these skills are given – a bit like not explicitly asking for basic literacy.

The heavy focus on mathematics and statics however, did make me hopeful. For one, it indicates and interesting field of work. Secondly, as long as mathematics still has the image of being a difficult thing that only a few chose can acquire, the amount of people who chose to delve into[i] it further will stay limited, enabling mediocre people to find work – the prospect of which makes my mediocre heart beat faster. This in addition to know that pen courses in mathematics can be fairly cheap.

It is also interesting to know that the chart seems to imply that a firm footing in traditional research will be further in demand. This is also supported by the adds that were used to give the insight into practice, that often mentioned Masters or PhD degrees, and according to the infographic nearly 10% of working data scientists have a PhD (compared with a about 3% of the general population in the US.)

The chart suggests ‘hacking skills’ as a major factor. I assume this is deliberately described in lose terms. The adds asked occasionally for experience or knowledge of specific languages or programming skills, but it seems that in general, once people get to grips with ANY programming language thoroughly, the acquisition of new or related languages is just a question of a bit of additional work.

 

Interestingly from this sample of adds (which is an unscientific sample of convenience) the demand for knowledge of database construction seems to be secondary, although there is a strong demand for SQL knowledge, and one add asked for Cassandra.

By looking at these two sources, the sample from the job add at a specific date and the chart made by datacamp.com shows a very interesting picture of what is involved in being a data scientist as well as to get there. The focus on math is interesting. As the technology and tools seems to move so quickly, it seems that mathematics is a tool or area of knowledge that is unchanging, and therefore worth investing time and effort in.

The other area that seems worth focusing on are longstanding programming languages such as python and Java. They should put you on a solid foundation.

SQL seems a useful tool to know, for work as a data scientist as well as for many other areas of work.

Interesting but not surprising is the complete absence of knowledge of digital marketing or management from both sources the infographic and the adds. The course in ‘Big Data’ seems to focus on these areas equally to the math and database. The way these areas are assessed makes it easy to get distracted by the areas that  are worth focusing on and these modules whose part in a big data course can only be guessed.

So for the continuation on the path to become a data scientist, for my Python will be the next area to look at, plus I finally want to enrol in the BSc in Mathematics degree, that I was looking at for years.

 

Another CA – Try R, and so I did

I have been using SPSS. A lot. Like most people using it, there are parts with which I was confident and content, and other parts (most of SPSS) that where guess work or following a cookbook in the hope that what I am doing is not outlandishly and obviously wrong. I had heard of R for a long time but it never seemed necessary to have  a closer look at it. That changed with the part-time course I’m attending, where the task were:

To complete the free course on codeschool.com on R: Try R, evidenced by a screenshot, like this one:

RRRR

Furthermore, the requirements were :

“Based on this course, please create an example use case based on some data that you have created.

Use the R Graphics to visualise the data.”

I have run out of ideas. I had big plans: collecting data on the influence of films on popular baby names, the influence of popular baby names on popular pet names, the influence of names on academic achievement (I seem to remember that students whose name start with A get more often A’s than other students, and that people with the first name ‘Kevin’ do worse than statistically probable in Germany while the reverse is true for ‘Sofias’. But as much interest I had in these areas (and fully aware that this will not be a case of ‘if-you’re-interested-in-it-it’s-going-to-be-interesting-to-somebody-else’ i decided to have a go on some simple data.

If I find something interesting to count in the office on Monday, I will. If it’s not covered by the confidentiality agreement I signed, then I will have a go at R, and see if I uncover something amazing. It might involve the canteen options, or the number of sneezes, or the amount of swearing around me.

But until then, this is a bar chart of the high temperatures in January in Dublin

barplot hi

It can be seen that it was pretty warm overall in January  with spring like temperatures towards the end, and a little slump in the middle of January. On the 30th a surprise slump occurred. When I looked at the low temperatures for January I really was surprised to find that according to accuweather.com there was only one day when the temperatures went below zero.

barplot lo

I am not sure where the frost on the roofs I could see in the mornings came from, but it is also true that my geraniums have not died and a yucca on the balcony seems pretty perky. So maybe it is true.

I also took bar charts of the historic low and high, but they are pretty flat – as would seem obvious with averages taken over a long period of time.

barplot his_hi barplot his_lo

It would be lovely to see though, if there really is a slump for the “ice saints” namely the ‘Cold Sophie’ (probably one of the Germans academically excelling, a group of three or so days in May that are said to be much colder than all the surrounding days, and maybe that’s what I will do next. Or really see that since it doesn’t rain in Ireland as much as it used to (really?) it rains more on the continent.

Google Fusion Tables

The task sounded easy at first:

To create a heatmap of the population in Ireland in Google Fusion Tables, describing the process in a blog post and saying a few words about the information gained.

Google apps are usually self-explanatory, self-guided, easy and fun to use. I was interested in an easy-to-use visualization tool to facilitate rudimentary data analysis. However, my hope of playing with a bit of data and visualizing surprising or random correlation was thwarted – mainly by the realization how difficult the process is and how deadlines always approach when you nearly got it.

My first difficulty was to not understand where to get the raw data from. I assumed, as maps are a particular strengths of google, that it will be easy to get a Irish county map from google. And it might be, but I could not find it. I also had a long look at huge data sets on the CSO site – so huge in fact, that I couldn’t see the wood for the trees, or in this case the tree for the woods, as it was so much incredible detailed data (age, gender, religion) that I wasn’t sure about the overall population of a county.

The counties themselves turned also out to be difficult – I never really understood the Ulster, Munster etc thing and how they relate to Clare, and Donegal etc. I never even understood if Dublin is it’s own county or not. Now I know.

So I needed two files, one a map of the counties, one a list of the counties with the associated population. I found a blog of a former participant (Thank you, Brian for sharing your advise) saying that he used a map in *.kml format from the irishindependent.ie website, and his population data from a summary table on the census website. I uploaded these onto the google drive, and merged them in the google fusion app with the merge function, and all kind of things happened but never the ones I wanted it to do. The county borders weren’t shown, or only the county borders were shown without any other information. Some of the heatspots were in the UK, one in the US, and one somewhere in the north of Germany (where, I know now, is a place called Carlow).

1 carlow is in germanyEven with my  rudimentary knowledge of Irish geography I thought it unlikely that Ireland expands that far and I needed to find out how to clean up my data (I cannot exclude the possibility, however, that there is a large Irish community in the north of Germany, as there seem to be a high  density of Irish pubs). I managed to figure out how to correct the Louth and Longford from the UK to the Isle of Ireland. I realized that in my original file a left-over (undeleted) “or” was understood by the app to mean OR for Oregon, which produced the heat-spot in the US.

1 Or is it in oregon

I also realized that there is two spellings for Laois, which shouldn’t have surprised me as the Irish always like to put in a few extra letters if there is space, so I aligned all the spellings.

As still nothing happened, I did some more research, to find that I failed by not merging them not by the matching columns (by name of county) but somehow to matched a description with the name of county. A mistake that shouldn’t have happened as we learnt about foreign and primary keys in the data base course.

Then I just needed to go through the options, and after trying a series of different colours, I decided to go for a gradient.

Uploading and merging the relatively small data sets takes a fairly long time, but I guess it takes a lot shorter and having to produce vectorized maps yourself.

The first map I produced revealed the information that Dublin is the area with the densest population, followed by Cork. I found this information rather non-revealing as most of the country seemed to be the same shade, and Dublin seems to be the outlier.

2 alles pale I made an inverse heat map where areas that were least populated were the most ‘outstanding’ . However, with the expectation how  a heatmap worked this turned out to be confusing.

3 focus on countrysideSo, in line with the traditional heat maps, I went back to the reddish colours for the densely populated areas, but changed the gradient to a finer gradient for the less populated areas.

And this is my final, hopefully clickable Google Fusion Heat Map:

final

The information yielded is that most of Ireland is sparsely populated, with the exception of Dublin and Cork.

Next steps for information would be to find historical data and see changes over time.