Value Iteration

This post is a summary of Lecture 1 of Deep RL Bootcamp 2017 at UC Berkely. All of the figures, equations, and text are taken from the lecture slides and videos available here. RL problems are…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




A Deep Dive into African Data Science on Kaggle

From a random DM on LinkedIn to Mabu for data science advice; into a long advice call and a resultant surprise offer of mentorship; followed by periodic calls about progress (and frustrations) on knowledge gained from courses and its application on projects; now into what should be the first of many awesome projects by Kusasa (watch the space).

Kusasa is a data science entrant with a background in Geographic Information Systems (GIS) administration and analysis, he has worked in i) the fleet monitoring and management space focusing on administering the GIS system of applications and geospatial analyses, and ii) the wildlife monitoring and management space focussing on providing technical support to their partners using their suite of applications. He has developed a passion for machine learning (ML) and data science which lead to the random LinkedIn DM. Mabu and Kusasa designed a practical programme to make a data science transition via DataCamp. Using the handy Career Tracks — which strategically organises modules together towards a specific career, they identified Data Scientist with Python (comprising 23 courses) as one of the first tracks to tackle.

This is all about Kusasa’s first data science project — his Exporatory Data Analysis (EDA) of the current state of data scientists as sampled by the Kaggle Survey. In this project, his intention was to take on an African perspective to asking questions of this Kaggle Survey sample, and use python-based tools to answer these questions as he hones his invaluable data wrangling and EDA skills for the data science workflow. He found some of the resultant answers to be expected, some answers to be informative, others to be inspiring, whilst a handful to be shocking.

The dataset chosen for this analysis is the 2020 Kaggle Machine Learning & Data Science Survey listed in the resources section below. It’s messy survey data that requires some solid python and pandas knowledge to clean up and make useful. It’s also an important dataset for some insights in not only how African data scientists compare to trendsetter countries but also the intra-Africa dynamics.

The insights from this exploratory data analysis are intended for professionals who are based in Africa, who intend to make a career change into the data science field.

Nigeria clearly has the most Kaggle data scientists in Africa. Out of 54 countries in Africa, it is shocking to see that only 6 of the countries appear to have data science activity on Kaggle. Again, this might also be an indicator of an untapped huge future job market potential in Africa.

Despite the trendsetter countries being only composed of 2 countries (India and the USA), they represent over a third of Kaggle data scientis in the entire world. They also have around 6 times the number of Kaggle data scientists compared to the 6 African countries combined. Potentially, there is still a big job market potential data scientists in Africa.

Both locally and abroad, female data scientists are dramatically under-represented. Understanding the drivers of this correlation, would be beneficial not only for the data science field, but also for Africa’s fourth industrialisation — as Africa has a high number of female-headed households.

In line with the global trends, data science in Africa is still dominated by males across almost all age groups and countries. Also, most Kaggle data scientists in Africa are young adults across the countries (between 25 and 35 years old).

A huge majority of data scientists in Africa have a tertiary qualification. The biggest demographic being those that have a Bachelor’s degree, followed by those that have a Master’s degree. Even though it seems having a tertiary qualification positively correlates with getting a data scientist job, there is still a number of data scientists who secured a job without a tertiary qualification. Therefore not having a tertiary qualification is not a stubborn hindrance to securing a data scientist job in Africa.

Coding experience in Africa vs the Trendsetter countries
Proportional coding experience by Kaggle data scientists in Africa vs the Trendsetter countries

Compared to the trendsetter countries, Kaggle data scientists in Africa tend to have much less years of coding experience. A handful of Africa’s data scientists don’t even have coding experience. The coding experience requirement in order to become hired as a data scientists seems lower than most of us may expect.

In the trendsetter countries, the dominant years-of-coding group is most influenced by the data scientists with masters degrees. Whereas in the African countries, the dominant years-of-coding group is most influenced by the data scientists with lower hierarchy degrees (bachelor degrees).

Machine learning experience of Kaggle data scientists in Africa vs the Trendsetter countries by college degree
Proportional view of machine learning experience by Kaggle data scientists in Africa vs the Trendsetter countries

Interestingly, in terms of years of machine learning (ML), Kaggle data scientists in Africa show a similar pattern of experience as the trendsetter countries — where most data scientists have less than 2 years of ML experience. Surprisingly, both in Africa and in the trendsetter countries, there is some data scientists who do not even use machine learning in their jobs.

The machine learning data scientists in the trendsetters tend to have more higher level degrees (masters and doctorates) compared to Africa (bachelors).

The data scientists which have no ML experience:

Top 3 used and suggested programming languages by Kaggle data scientists in Africa vs the Trendsetter countries
Top 3 used and suggested programming languages by Kaggle data scientists in Africa vs the Trendsetter countries

The exact same pattern happens globally, python is the most used programming language of data science workflows, followed by SQL for storing and accessing structured data. R is still heavily used by some data scientists as an alternative to python.

With regards to programming languages used and suggested, the pattern is the same between the African countries. Python is by far the most used programming language by the data scientists, followed by SQL and R. It’s worth mentioning the heavy use of C, C++ and Java in some African countries — these are most probably used in the data engineering and production phases of the data science workflow. In that order, this seems like the order of priority that a budding data scientist must use in his/her learning journey.

It’s not surprising that there is a huge number of data scientists that form part of the lowest compensation tier — given data science’s recent rise (in Africa) as an attractive job prospect for new employees (including graduates and professionals switching to the data science field).

At entrant job level, there is the same pattern of compensation in Africa as compared to the Trendsetter countries. The massive difference starts showing at the mid and top tiers of the data science job market. On average, the Kaggle data scientists in Africa earn around $15k per annum, versus the Trendsetters’ $83k per annum (a difference of almost 600%). Africa’s biggest earners get around $124k per annum, versus the trendsetters’ > $500k per annum (a difference of at least 400%).

Unsurprisingly, amongst the trendsetter countries, it is the data scientists in the USA that tend to earn the big bucks.

Unsurprisingly, there appears to be a positive correlation between the number of years that a data scientist has been coding and their salary. Since there is a small number of data scientists who have been in the field for a long time, their low supply increases market competition to acquire their services.

Similarly, there appears to be a positive correlation between the number of years that a data scientist has been using machine learning and their salary. Compensation seems to peak at around $80k for data scientists with 5–10 years of ML experience. For those with more than 10 years of ML experience, compensation peaks at around $60K. The trend seem to say that having ML experience beyond 10 years doesn’t give you more money — which is interesting. Traditionally, academia has more experienced people with less compensation whereas corporate tends to have less experienced people paid handsomely. One can speculatively use these dynamics to extrapolate the analysis of this trend.

Having a tertiary qualifications potentially has an impact on increasing the salary of the Kaggle data scientists in Africa. The highest earners have a Master’s degree. However, having a doctorate does not necessarily improve the earning ability of the data scientists. The same academia vs corporate speculation can be used here.

Company and data science team sizes of Kaggle data scientists in Africa vs Trendsetter countries

The Kaggle data scientists in Africa tend to work for small companies and in small teams. This may mean that most data scientists in Africa have to have a broad skillset to build and operate the entire data science ecosystem (skills mostly associated with software engineers and data engineers), and may likely spend lesser time developing ML models.

Number of data scientists by team size per given company size

Of particular note is the small companies (0–49 employees) that have dramatically large data science teams. These are likely start-ups which are heavily focused on selling data-related services — in line with the current high attractiveness of the data science field.

Even though the data scientists in Africa tend to be hired more by small companies and small data science teams, it is the the data scientists who are in big companies and big data science teams that tend to earn dramatically more.

The Kaggle data scientists in Africa spend most of their time on building and maintaining data infrastructure, and using the data to do analyses that feed business insights. This is inline with the previous realization that most of the data scientists work in small companies that also have small data science human resources. Therefore aspiring data scientists should prepare themselves accordingly, and also see this as an opportunity to engage the entire data science ecosystem. Furthermore, for those wanting to get their hands dirty on a variety of responsibilities, working for a smaller company may be a good option.

Matplotlib and seaborn are the most used plotting libraries in Africa and the Trendsetters, closely followed by Plotly — in line with the high usage of python over R. These 3 plotting libraries should likely be the ones that aspiring data scientists must focus on.

Scikit-learn is still the landing data analysis framework for machine learning, followed by keras and tensorflow more so for deep learning. These 3 machine learning frameworks should likely be the ones that aspiring data scientists must focus on.

As relatively simple as they are, the linear/logistic regressions are still the most used ML algorithms, which are closely followed by the decision trees/random forests algorithms. Together with gradient boosting machines, these 3 ML algorithms should likely be the ones that aspiring data scientists must focus on.

Github and Kaggle are the most used platform where data scientists publicly share their work. Therefore these are the platforms that aspiring data scientists should use not only to find and learn from other data scientists’ work, but also to eventually start sharing their own work for profiling their work experience.

Online learning platforms used by data scientists in Africa vs Trendsetter countries

Coursera and Udemy are the to-go-to platforms for online learning for data scientists in Africa and the trendsetter countries. Whereas the third most used platform in Africa in DataCamp, in the trendsetter countries it is University Courses which actually result in university degrees. This points to universities starting to play catch up for accommodating the booming data science field.

For a fast growing field such as data science, staying up-to-date with the leading technologies, processes and thinking is paramount. The commonly used media by the data scientists are Blogs, Kaggle and YouTube.

Besides testing and showcasing EDA skills, this project is also meant to spark some conversation around ML and Data Science in Africa. The comparison with trendsetting countries is aimed at identifying gaps that we can learn from and have productive conversations about.

So talk to us :)

Add a comment

Related posts:

What exactly does it mean to be a business consultant?

Business consultant. It’s one of those nebulous titles almost as enigmatic as one who goes to work each day to perform his job as internal optimization coordinator, product integration analyst, or…

Keep the Inn Open

Even for guests who arrive in the wee hours, awakening you although, oh, it’d be so much more comfortable to continue inhabiting a dream, keep the inn open within you. Yes, this last spate of guests…

Como 2.25.2020

When I was thinking about a community I could document through photography, my mind immediately drifted toward the Como area. I always enjoy walking around there, and I always find a lot of natural…