Data are becoming the new raw material of business
The Economist

Spark 2.0 on Jupyter with Toree


Spark is one of the most popular open-source distributed computation engines and offers a scalable, flexible framework for processing huge amounts of data efficiently. The recent 2.0 release milestone brought a number of significant improvements including DataSets, an improved version of DataFrames, more support for SparkR, and a lot more. One of the great things about Spark is that it’s relatively autonomous and doesn’t require a lot of extra infrastructure to work. While Spark’s latest release is at 2.1.0 at the time of publishing, we’ll use the example of 2.0.1 throughout this post.


Jupyter notebooks are an interactive way to code that can enable rapid prototyping and exploration. It essentially connects a browser-based frontend, the Jupyter Server, to an interactive REPL underneath that can process snippets of code. The advantage to the user is being able to write code in small chunks which can be run independently but share the same namespace, greatly facilitating testing or trying multiple approaches in a modular fashion. The platform supports a number of kernels (the things that actually run the code) besides the out-of-the-box Python, but connecting Jupyter to Spark is a little trickier. Enter Apache Toree, a project meant to solve this problem by acting as a middleman between a running Spark cluster and other applications.

In this post I’ll describe how we go from a clean Ubuntu installation to being able to run Spark 2.0 code on Jupyter. Continue reading

5 secrets for writing the perfect data scientist resume

handshake-2056023_960_720Data scientists are in demand like never before, but nonetheless, getting a job as a data scientist requires a resume that shows off your skills. At The Data Incubator, we’ve received tens of thousands of resumes from applicants for our free Data Science Fellowship. While we work hard to read between the lines to find great candidates who happen to have lackluster CVs, many recruiters may not be as diligent. Based on our experience, here’s the advice we give to our Fellows about how to craft the perfect resume to get hired as a data scientist.

Be brief: A resume is a summary of your accomplishments. It is not the right place to put your little-league participation award. Remember, you are being judged on something a lot closer to theaverage of your listed accomplishments than their sum. Giving unnecessary information will only dilute your average. Keep your resume to no more than one page. Remember that a busy HR person will scan your resume for 10 seconds. Adding more content will only distract them from finding key information (as will that second page). That said, don’t play font games; keep text at 11-point font or above.  Continue reading

Mindset Shift: Transitioning from Academia to Industry

Special thanks to Francesco Mosconi for contributing this post.


office-1209640_960_720Transitioning from Academia to Industry can be difficult for a number of reasons.

  1. You are learning a new set of hard skills (data analysis, programming in python, machine learning, map reduce etc.), and you are doing this in a very short time.
  2. You are also learning new soft skills, which also require practice to develop.
  3. A mindset shift needs to occur, and your success in industry will strongly depend on how quickly this happens.


Learn to prioritize

When your goal is knowledge, like in Academia, it is okay to spend as much time as you want learning a new concept or completing a project. On the other hand, during this program and on the job, you will often find that there is not enough time to deal with all the tasks and assignments required of you. In a situation where you have more on your plate than you can handle, it is essential to develop the ability to decide which tasks require execution, which can be postponed, and which ones can be simply ignored. There are many frameworks and approaches to prioritization, famous examples including the Getting Things Done system and the Eisenhower Method. Most methods are good, and eventually you will find your favorite one; however, they only work if consistently applied. In other words, it is less important which prioritization method you choose but it is fundamental that you prioritize your day and your week according to the specific goals you are to accomplish. Continue reading

Three best practices for building successful data pipelines

On September 15th, O’Reilly Radar featured an article written by Data Incubator founder Michael Li. The article can be found where it was originally posted here


pipeline-1585686_960_720Building a good data pipeline can be technically tricky. As a data scientist who has worked at Foursquare and Google, I can honestly say that one of our biggest headaches was locking down our Extract, Transform, and Load (ETL) process.

At The Data Incubator, our team has trained more than 100 talented Ph.D. data science fellows who are now data scientists at a wide range of companies, including Capital One, the New York Times, AIG, and Palantir. We commonly hear from Data Incubator alumni and hiring managers that one of their biggest challenges is also implementing their own ETL pipelines.

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

  1. Reproducible
  2. Consistent
  3. Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization. Continue reading

How to Kickstart Your Data Science Career

At The Data Incubator, we run a free six week data science fellowship to help transition our Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20. 10
3 Alice 2005 25. 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.
Continue reading

A CS Degree for Data Science — Part I: Efficient Numerical Computation

apple-1851464_960_720At The Data Incubator, we get tons of interest from PhDs looking to attend our free fellowship, which trains PhDs to join industry as quants and data scientists.  A lot of them have asked what they can do to make themselves stronger candidates.  One of the critical skills for being a data scientist is understanding computation and algorithms.  Below is a (cursory) guide meant to whet your appetite for the computational techniques that I found useful as a data scientist.  Remember, there’s a lifetime of things to learn and these are just some highlights: Continue reading