The Absolute Beginner’s Guide for Data Science Rookies

How to dive into the data ocean and not drown trying













So you probably heard about “data science” in some random conversation in a coffee-shop, or read about “data-driven companies” in an article you found scrolling down your favourite social network at 3 AM, and thought to yourself “What’s all this fuss about, huh?!”. After some investigation you end up seeing flamboyant quotes like “data is the new oil” or “AI is the new electricity”, and start to understand why Data Science is so sexy, right now learning about it seems like the only reasonable choice.


Luckily for you, there’s no need for a fancy degree to become a data scientist, you can learn anything from the comfort of your home. Furthermore the 21st century has established online learning as a reliable way to acquire expertise in a wide variety of topics. Finally, Data Science is so trending right now that there are infinite and ever-growing sources to learn it, which flips the tortilla the other way round. Having all these possibilities, which one should I choose? Well my friend I hope my experience in the last couple of months can help you clear your doubts. Let’s start, shall we?



1. What to learn?


A data scientist is the player that can start the play from its own goal, dribble past a couple of defenders, make a precise cross to the penalty mark and head the ball inside the net for a gorgeous goal. Sorry for the football reference, can’t help it, just wanted to picture how you’ll master a diverse set of skills that’ll make you very useful in almost any data-related problem.


Now, I’ll divide these requirements under 2 approaches. First, from a technical point of view we’ll review the foundations, that means the fields of data science relies on from one way or another. Second, from a more practical point of view I’ll show you which programming libraries you should focus on in order to put your hands on real data projects.


1.1 Data Science Foundations


  • Programming 💻: Your first task will be to choose either you’ll use Python or R (I’ll leave you some help here, here and here) and then immerse yourself into coding.


  • Linear Algebra 📐: As you’ll be working with data you’ll want to know how to represent data sets as matrices, and understand concepts like vectorization and orthogonality.


  • Calculus 🔗: Many of the models you’ll write and use will use tools like derivatives, integrals and optimization to compute and find a solution to your problem more rapidly.


  • Probability 🎲: While you use data science, many times you’ll be working to predict something in the future so you’ll want to know how likely something is to happen or why two events are related.


  • Statistics 📊: In order to describe the information you’ll be analysing, things like the mean or percentiles will come in handy, also tests to check your hypothesis will appear along the way.


  • Machine Learning 🤖: Maybe the core of data science, at some point during your project you’ll want to predict something and that’s when machine learning kicks in.


1.2 Data Science Libraries


As you’ll discover after some time coding on your own, every programming language counts with a series of packages or libraries that provide different functions and methods to perform diverse tasks with more ease.

Here you’ll find a table with the most popular and useful libraries for Data Science in Python with some brief guidelines below. In case you’ve gone the R way, don’t worry, I’ll also leave you a very good article with a similar table for R libraries here.



Library

Category

Description

Website

NumPy

Starter Kit

Provides the base for large dimension arrays/matrices analysis with the ndarray object and several functions/methods.

http://www.numpy.org/

SciPy

Starter Kit

Has statistics, optimizations, integration and linear algebra packages for numerical computations.

https://www.scipy.org/

Pandas

Starter Kit

Provides high level data wrangling with the dataframe and series objects which help filtering, grouping and other tasks with data.

https://pandas.pydata.org/

Matplotlib

Starter Kit

Low-level plotting library for creating two-dimensional diagrams and graphs like histograms and scatter plots.

https://matplotlib.org/

Scikit-learn

Starter Kit

Provides algorithms for many standard machine learning tasks such as clustering, regression, classification, model selection, etc.

https://scikit-learn.org/

Tensorflow

Deep Learning

Most popular deep learning framework, created by Google Brain, allows the creation of all sorts of artificial neural networks.

https://www.tensorflow.org/

Keras

Deep Learning

It's a high-level library for working with ANN's, running on top of TensorFlow or Theano, allows fast prototyping and implementation.

https://keras.io/

PyTorch

Deep Learning

Large DL framework that allows to perform tensor computations with GPU acceleration and create dynamic computational graphs.

https://pytorch.org/

Bokeh

Data Visualisation

Creates interactive and scalable visualisations in a browser providing a vast collection of graphs & styling/interaction possibilities.

https://bokeh.pydata.org/

Seaborn

Data Visualisation

Higher level API based on Matplotlib, allows to create more complex and better looking plots like time series and joint plots.

https://seaborn.pydata.org/

Plotly

Data Visualisation

Web based framework that allows to build sophisticated graphics easily and it is also adapted to work in interactive web applications.

https://plot.ly/

NLTK

NLP

Whole platform for Nat. Lang. Processing, it can process and analyse text in a variety of ways, tokenize and tag it, extract information, etc.

https://www.nltk.org/

SpaCy

NLP

Other natural language processing libraries with excellent examples, API documentation, and demo applications.

https://spacy.io/

Scrapy

Web Scraping

Complete web scraping framework which can create spider bots to crawl and retrieve data from the web and through APIs.

https://scrapy.org/

BeautifulSoup

Web Scraping

Parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

StatsModels

Others

A module equipped with all you need to perform simple and advanced statistical analysis such as model estimations and tests.

https://www.statsmodels.org/

XGBoost

Others

Gradient boosting framework for machine learning, which uses ensemble learning to obtain better predictive performance.

https://xgboost.readthedocs.io/

Top Python Libraries for Data Science



The Starter Kit is all you need to start doing data science, Numpy provides the base for working with data but you’ll handle it more easily with Pandas. Scipy provides some fancy functions and methods to perform advanced calculations on top of the Numpy framework, and Matplotlib will allow you to plot your findings visually. Finally Scikit-learn is the starting point for machine learning, it contains everything you need to apply all classical regression, classification and clustering methods.


On the other hand, Deep Learning frameworks will help you build artificial neural networks to perform more complex machine learning tasks like image recognition. Then there are other Data Visualization alternatives which allow to create more stylized and interactive plots even on web applications. Natural Language Processing (NLP) is a very popular field within data science which for example allows Alexa or Siri to understand what you’re saying. When looking for data to do your analysis the Internet is an unlimited source for that purpose, so Web Scraping tools will come in handy to collect and retrieve this data on a frequent basis. Last but not least, Statsmodels (statistical analysis) and XGBoost (gradient boosting) will help you in some more specific tasks.

1.x Data Science (not so) Bonus Tracks


So far we’ve talked about working with data, but what type of data? Well it can range from very pretty, clean and structured csv files to gigantic datasets with millions of examples of very unstructured data, yes buddy, we’re not in Kansas anymore.

Depending on where you end up working, your company might handle data in different ways but most certainly they’ll handle very large data sets, so these two tools coming up next are a definitive must for every data scientist.

  • Databases: Big companies store their data in databases. Why not a spreadsheet? Well, basically databases are a much proper way to handle large chunks of data while ensuring data integrity and security, and also allowing easy querying and updating. Now, the thing is there are 2 types of databases, relational DBs (SQL) and non relational DBs (NoSQL), you might want to learn the differences (this post may help you) and hopefully learn to work with both.


  • Big Data: When it comes to working with a lot of data you also have to think how you’ll process and refine that amount of information. When you have thousands of rows you may need just a few seconds to perform simple tasks, but when you need to run a highly complex model on millions of records then you’ll probably need days. For this purpose you use models for parallel and distributed computing, in order to perform multiple tasks simultaneously in different cores or CPUs. The 2 most popular frameworks are Hadoop MapReduce and Spark (read about their differences here).

3. What’s next?


We’ve come a long way already, learning what and where to learn everything you need to become a data scientist, but what do you do now? I’ll show you how to put your skills in practice the right way so you’re tuned with the data science community.


  • Jupyter 📓: The scientific world is a bit messy, tons of maths, graphs and code may seem like you’re doing some really avant-garde stuff, but if you’re not able to show it clearly then it’s worthless. That’s why awesome Jupyter Notebooks appeared to help you, they allow to run your code, plot amazing interactive plots, import any kind of data, write beautiful equations and comment everything appropriately through markdowns in a well presented notebook so you can show your work, without being an absolute indecipherable mess. Here you can see an example of a Jupyter Notebook to understand what I’m talking about.


  • Kaggle 🏆: Competitive data science may sound like major leagues, but Kaggle (acquired by Google in 2017) is an awesome data science platform and community that allows companies and organisations to create competitions (sometimes it’s a problem they actually want to solve) where everyone can participate and even win juicy prizes, counting with live rankings showing the best results so far. Besides it counts with a lot of learning material to get you started, and a huge repository with datasets of all kinds and for different purposes, so basically you have everything you need to start working on real projects!


  • GitHub 🐱: Within software development, something called Git was created, a version control software, which basically helps keep track of changes in a project where multiple people work. After this GitHub appeared, a web-based hosting service for version control, it helps, in very simple words, keep the projects on the cloud organised and logged. In GitHub you’ll find all sorts of awesome repos you’ll be able to fork and modify for your personal projects, but most importantly, GitHub will help you keep an organised portfolio for you to showcase your work when looking for a job.


  • Stackoverflow ❓: The coding road is not exempt of bumps on the way, in fact it can become very annoying handling all the errors and debugging hundreds of lines of codes, but as part of human nature, we’ve learned from our mistakes and learned how not to trip over with the same stone. Stackoverflow is a Q&A community of developers from all over the world which allows you to find explanations to common (and not so common) problems in programming through threads where everyone can ask and respond easily. Remember this when you’re desperately Ctrl+C/Ctrl+Ving an excerpt of your code in Google that will redirect you to your glorious answer in Stackoverflow, just magic.


Some final thoughts


I strongly believe a good professional is always learning, in a world as dynamic as ours you can become obsolete in no time, and that’s where the power of learning resides, being able to keep up to date and reinvent yourself will not only help you grow as a pro but also as a person.


Finally, learning is a very personal process, there isn’t (yet) a magic formula that works for all and even though throughout this post I’ve tried to present multiple choices in the form of suggestions, the final call is up to you, just make sure you enjoy the ride 🏄



Thank you!


-by

Shouvik Bajpayee

Pursuing Master's in Distributed and Mobile Computing

B.Tech in Computer Science and Engineering


Comments

Popular Posts