Most Important Thing

By September 20, 2017 No Comments
DATA SCIENCE ARTICLES > Most Important Thing

There is a tremendous amount that you *can* know about data science. Turns out that there’s much less that you *have* to know to get that first job.

However, there’s one thing that you absolutely must know and it’s something that seems to not be taught in classes (online or otherwise).

And that is to know the pros and cons of each of the most common modeling techniques. When given a problem, you need to have a good sense for which technique to try first.

When I say “common modeling techniques”, I mean things like:

  • Linear regression
  • Logistic regression
  • Classification and regression trees
  • Random forest
  • Naive Bayes (yes, this is something you need to know!)
  • k-NN
  • k-means clustering
  • Hierarchical clustering
  • Support vector machines

And when I say “pros and cons”, I mean things like:

  • Strong assumptions (e.g., linearity, independence of features)
  • Computational complexity
  • Bias and variance
  • Interpretability
  • Susceptibility to overfitting

When you do on-site interviews, you are pretty much guaranteed to be asked to go up to a whiteboard and work out some problems. A very common type of problem is for them to pose a typical data science problem to you and then ask you to solve it.

Click here to subscribe

Depending on the interview, they may give you very little initial guidance. And it’s unlikely they will explicitly tell you which modeling technique they want you to use.

They want you to select one and they’ll be evaluating your ability to do so.

So, here’s my advice: take out a piece of paper or create a spreadsheet and make a grid.

List each of the popular modeling techniques as rows.

List all possible attributes of each model as columns.

Then fill in the squares for each method.

For example, for Naive Bayes you would put “independence of features” for the “Strong assumptions” column, “Low” for “computational complexity”, “High bias/low variance” for that column, and so forth.

If you don’t know the pros and cons of each method, then just Google it or go to the relevant Wikipedia page.

Then make sure to review this grid before each interview you go on. Heck, it would probably be good to memorize it. It’s not like that’s no longer going to be important once you get a job.

Yes, obviously you should know the theory behind methods. And it’s essential that you have experience implementing them and solving problems with data.

But knowing when to use each model is a good test to see if you both know the theory and the practicality of the methods.

To your success,