I'm taking two classes this semester. Stepping it up from the single-class semesters. I think it'll be ok though. I'm in Intro to Data Science and Bayesian Statistics. Man, a year ago, I was so excited to be in a DS class. And now, here I am, and it feels like bush league. It's because I've been studying this stuff so intensely for the last year, doing things the right way, no shortcuts, doing my own research, listening to DS podcasts, etc. I'm not terribly surprised when our professor is talking about reproducible research using Knitr. Got it, dude. I've been rolling with RMarkdown for at least a semester, and LaTeX even longer.
And despite how it sounds, I'm trying NOT to be a know-it-all. But it's hard. Because I've all the things we're learning in this class are things I've been doing on my own in other classes. So I'm just trying to keep my mouth shut and try to learn a thing or two. And I have! In learning R, I've skipped over a lot of the programming language aspects of it. I read somewhere, R is meant to be learned in tandem with statistics. So that's what I've been doing. And in doing so, I've skipped a lot of the fundamental programming language stuff that I would normally learn when learning any other language. So now's a good time to start picking that stuff up.
But to the title of this post, we've got a semester-long project we have to work on, and on a long run today, I was thinking about project ideas. Then I got to thinking about why it's so hard to come up with good DS projects. Here is my attempt to explain this, which can be used as a catalyst to explain why data science will never be fully automated, at least until we have artificial creativity (still a long way from reality).
1) There are two high level purposes of machine learning - classification and prediction.
So far so good. I think we can all agree on this. We're either trying to put things in buckets or guess what the next value is going to be. Of course it gets a lot more complicated when you dive in, but at the high level, that's it.
2) There are 5 types of predictions we can make (I came up with these off the top of my head while on a run, so don't take this as gospel).
2a) Natural phenomena - weather, stock market, global economies.
2b) Human phenomena - Baseball players hitting home runs, football teams winning the Super Bowl, how much I will weigh next year, will you click this ad, will you buy this product
2c) Social phenomena - Presidential elections, who you might want to follow on Twitter, Data Scientist wages
2d) Games of chance - Poker, Blackjack, dice, flipping coins
2e) Games of strategy - Literally any board game with multiple players
(There are almost certainly others, but this is seriously all I can think of and it makes sense to me)
Furthermore, there are three stages to data science (ONCE AGAIN, MAKING THIS UP AS I GO):
I) Asking questions, designing the problem
II) Modeling the problem, delivering results
III) Making decisions on those results
I am coming to the conclusion that data science doesn't spend enough time on I and III. It focuses so much on II, the technical side, that it forgets to even teach us to ask questions or tell anyone why the model is useful. And maybe in many scenarios, the business teams, managers, VPs and CEOs perform I and/or III. But I am somehow in a position where I need to prove why DS is useful. All the technical knowledge in the world ain't gonna make that happen. I need to be able to apply it. So really, I and III are about advertising ourselves, somewhat. Machine Learning without purpose is just a cool parlor trick. But use it to answer a really a good question and then perform some action to make life better, and you're a superhero.
So anyway, that was my key insights today as I hauled in 8 miles. If I'm way off on this shit, let me know in the comments.