Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Machine learning for financial prediction (robotwealth.com)
214 points by Matetricks on June 12, 2016 | hide | past | favorite | 82 comments


It's just so ridiculously easy to overfit these models, and so so many ways to shoot yourself in the foot as a result.

For example, "I split the data set into 5 random segments and then trained a model on 4 of the 5 segments and then tested it on 5th." Such data is serially correlated (it's not good old iid) so already it looks like you have poisoned the test set with information from the training set.

The hard part is not "feature engineering" or "ensemble methods", the hard part is controlling the entropy that you feed these things because they are voracious monsters and will absolutely eat all of it.


> Such data is serially correlated (it's not good old iid) so already it looks like you have poisoned the test set with information from the training set.

Kind of. If it was that simple making money off of an autoregressive model would be trivial -> everyone would do it -> serial correlation would disappear.

I agree with your observation that figuring out what to feed the beast is one of the bigger challenges though. Case and point: train a mean reversion model on the last seven years of S&P data to buy dips and train a momentum model to buy higher highs. That equity curve would look very encouraging. Do it on a fifteen year basis, and not so much. Now the question becomes: how long of a lookback do you use when training your models? Chopping up data at random will mux out useful correlations. Subsetting into periods leads to poorly generalized models. Not fun.


It's so easy to do machine learning and think you're a genius when you are in fact overfitting. It's almost like casino gambling. You tweak some hyperparameter, pull the slot machine lever, and wham, your model says you should be rich real soon...


This is one of the better responses. Issues that arise: low # of data points at macro timescale, time series data (and local correlation between individual data points) making it hard to extract training/testing sets, and the overarching structural shifts in the market over time that invalidate older data (depending on context).


Doesn't that cut both ways though? If there are serial correlations in the data then modeling and accounting for the variance explained by those correlations should help with future predictions, no?


If anyone else is getting errors when loading the page, here's the google cached version http://webcache.googleusercontent.com/search?q=cache:-ciyXfS...



+1 thanks for the cached version.


Thank you.


Thank You !!!


There are a few problems with turning your laptop into a money machine using data analysis.

Remember the maxim, past performance is not a guarantee of future results. You can develop strategies based on past data that will beat the market, but, the nature of markets is to adapt to kill your edge. Markets adapt constantly and your edge stops working at an unknown point in time. It's unknowable when that WILL happen because past data can't show that.

The other reason is transaction costs. In gambling called vig. Let's say I'm betting NFL games. NFL home teams win 51% of games. Even flipping a coin I've read come up heads 50.1% of the time. These are profitable systems. But you're paying the bookie 10% on each loss. You could find someone to bet you on coin tosses and bet heads each time. You have a positive expected return, although you need a huge number of flips to make money!

In trading of course costs is commissions. Why do you think there was a rise in HFT? The strategies are consistently profitable. (Besides the flashing/manipulation tactics) It is ONLY profitable because of extremely low commission costs that are not available to the retail (or even semi-professional) trader.

Systems that can pull $0.0001 out of every share traded overall on high volume can be (pretty easily) created, but you can't trade them profitably. In fact, you will find commissions (semi-pros who pay about $3 per 1000 shares) priced right at the point of an edge you could be expected to develop.


"nature of markets is to adapt to kill your edge"

If you are a low volume, small time trader, the market isn't going to move as quickly to adapt to you. If you have $100,000, for example, and return 30% a year, you aren't on anyone's radar.


Provided you're the only one trading your strategy, which is unlikely.


There are trading firms that have made fortunes letting the algorithms do the trading:

http://www.fool.com/investing/general/2014/02/09/this-man-ma...

https://en.m.wikipedia.org/wiki/Renaissance_Technologies

Yes, you have to adapt but these guys play with billions.


Ren Tech also has 200+ math and signal processing phds in its payroll. I'm not sure how well the individual trader does (maybe they do okay, but I have my doubts that I'm not going to be consumed by big sharks in the market)


Yes, everyone's heard of RenTec, particularly those of us in the industry. Their existence doesn't seem seem germane to my point.

You were claiming that a little guy's edge won't be rapidly eroded by the market. This is only true if his trading strategy is uncorrelated to every other little guy's trading strategy. But this is unlikely. The space of unsophisticated strategies is not all that large.


So what are those hundreds of people doing at RenTec if "the space of unsophisticated strategies is not all that large"? Do you think their $100 million algorithmic trades go unnoticed? A $10,000 trade is going to make someone take notice of a little guy?

At any rate, is there some reason that you are driving the discussion to "it's not possible". No one learns anything by saying "it's impossible". Perhaps we should have ignored the post.


They're developing sophisticated strategies. Duh.


ok, what's the difference between sophisticated and unsophisticated algorithmic trading?


Poster below is kind of right that you're never going to get an in depth answer for free...

But a few key points that separate the two: 1. It's very easy to make 50% percent a year on a few hundred thousand. If you can't even do that, it's not worth even bothering to compete. It's VERY hard to do the same with a few hundred million or worse, a billion.

2. With regards to #1, the key difference is market impact. When you start trading a non-trivial percentage of a symbol's average daily volume (eg; 10%+) you start having effects on the price. A dumb strategy would be to just place market orders for the full amount. Someone will just place a cascading set of limit orders that you'll hit as soon as you wipe out existing liquidity on the book. A slightly better strategy might hide the total order. A much better strategy will place thousands of small orders of random sizes at different times across different exchanges to simulate organic market activity and attract liquidity. This order sizing is probably based on both predictive models and analysis of the full exchange feeds (that are both very monetarily and computationally expensive to use)

3. Sophisticated algorithmic trading will either try and get the market to do something (eg; place orders in such a way as to elicit a reaction from the market) or use non-market data in combination with market data to make decisions. These approaches add external entropy and allow for more theoretical alpha than reacting to lagged market signals.


That's not free advice. Good luck.


30% on 100k is 30k. You'd be better off getting a regular job unless you can sustain that for more than 10 years. Which you can't predict.


No one said that this was your full-time job. No one said that it has to be HFT either. Algorithms identify good trades at 4:01pm and you buy the next day. And no one is saying that you have to trade everyday.


Exactly. Plus let's remember you are adding nothing to life here. All you are doing is collecting 30k off other people with a slightly less optimal "strategy" than you.

And no I don't believe these people are "adding liquidity and assisting price discovery".

to the reply as I can't post as HN censors detractors of big finance:

I don't believe the benefits of liquidity added by HFT are worth the enormous costs firms sink into it.


>> And no I don't believe these people are "adding liquidity and assisting price discovery".

The nice thing about reality is that it remains even when your belief persists against it.

In large-cap stocks during rising markets, high frequency trading does improve liquidity[1]. While the effect may not be as prevalent during a downturn and it may not impact smaller stocks as much, I'd like to see your evidence that it actively harms the market or that the practice is vapid and produces nothing of value.

Or is that just a statement you made because it nicely aligns with your political conception of Wall St?

EDIT: It helps to point out that "algorithmic trading" and "high frequency trading" are not at all the same thing, especially as these terms are usually conflated on HN. An algorithmic trading system does not necessarily need to trade at high frequency. Some algorithmic trading systems make trades in intervals of days or weeks, not seconds or milliseconds. The paper cited here describes the market-making activities of what is traditionally called high frequency trading and the benefits it has over human brokers of the past, but it uses the umbrella term "algorithmic trading."

EDIT 2: The parent comment responded to this one by editing his original one, because "HN censors detractors of big finance." You also claimed you don't believe that the liquidity provided by HFT is worth the capital that large firms dump into it.

In 2013 the entire HFT industry made about $1B, down from $5B in 2009[2]. HFT is not a large industry. It is eating much of Wall Street's traditional market-making inefficiencies, which is why it is widely disliked, but it is not "big finance." Big Finance is generally opposed to HFT.

You still haven't provided evidence or numbers to prove or even quantify what you're claiming. Are you saying HFT is not worth the investment to firms, or are you saying it isn't providing some vague "value to society" relative to alternative uses of investor capital?

The first case is obviously nonsensical, as many firms generate profit using high frequency trading strategies. The second case is like saying we shouldn't do anything if it doesn't save impoverished children in Africa. The added liquidity has a material and beneficial impact on trading outcomes for buy-and-hold retail investors, which is shown in my first citation here. You have yet to satisfactorily refute this.

[1]: http://faculty.haas.berkeley.edu/hender/Algo.pdf

[2]: http://www.bloomberg.com/news/articles/2013-06-06/how-the-ro...


Generating a profit does not equate to generating wealth.


How about just making a transaction because you feel that you gain something from it? Does it need any other motivation? I mean, I buy food, go to a concert, rent a car, and buy a stock, not for the purpose of gifting something to society, but for my own sake.


This is the opposite - you don't buy and then resell food in a few seconds for gain.


Former professional investment manager here...

The biggest problem with things like this, which almost nobody talks about in the context of investing, is publication bias.

100 people try to develop a profitable trading algorithm. 1 comes up with one that looks great on back-tests at a 1% confidence (in other words, exactly what you'd expect from random chance alone over 100 trials).

That person writes an article/pitch/business plan based on their algorithm. You never see results from the 99 who failed.

Going forward, the successful algorithm is no more likely to work than the failed 99, but from the perspective of the general public it sure looks like a winner!


There's an old con game - you send 500 letters to gamblers, predicting the next Dodgers game. 250 predict they'll win; 250 say lose. Game happens, 250 people think Hey lucky guess. To those you send 250 letters, 125 predict they'll win the next game; 125 lose. After 6 games you have 8 people who have seen you guess right 6 times in a row. Get them to pay you for another (worthless) letter.


So much this. Also even if you have won it means very little going forward. If you put 100 guys in a room and asked them to try to flip N consecutive tails one guy will come out thinking he is the king of flipping, with a rock-solid "system". He's just someone who doesn't understand probability. And as you say you don't hear from the other 99 including the math guy who flipped N/2 heads and is muttering about it.


> 100 people try to develop a profitable trading algorithm....

It's much worse than this with machine learning approaches. Imagine a million people trying to find a profitable algo, all on your laptop, and you are choosing the best one out of all of those.

If you are used to pen-and-paper trading strategies, or even excel spreadsheets, machine learning is just a completely different level to this. And probably how it works will be unintelligible to anyone. I don't even see how someone can write a business plan based on this.


The type of approach used has limited effect on survivorship bias, what matters is the number of people employing different approaches and the size of the effect. So if machine learning approaches can produce real results, the data will show this. Survivorship bias is real, but it is not the full story.


If you can actually reliably generate alpha from a model like this there is no point of running the strategy yourself. There are any number of hedge funds that will sign you on, let you keep all of the IP you develop, and give you 10-12% of any returns you generate. That sounds small, but it's mitigated by the fact that you will have access to potentially billions of dollars in capital to trade if your strategy has the capacity for it. So you get 10% of a much bigger pie, with way less downside risk. Plus you get access to all their internal trading systems, execution services, data feeds, etc, which are usually orders of magnitude better than what an individual has access to.


Who do I contact? I have a deep learning startup that is trading forex right now, I would like to make some contacts and see if I can integrate


Create a public, real account on myfxbook.com and let them find you. If you contact them directly, you can link them to your myfxbook account to give them an instant view into how your algorithm has performed.


Why FX? More direct access to exchanges?


it's the most liquid market on earth, it's 24 hour for the majors, its depth is enormous, and it is less prone to event risk than individual equities (as long as you stick to unpegged currencies) so you can get tons of leverage on it.


I think financial prediction via machine learning will be a useful cruicible for defining AI from non-AI. So far, so many companies that have applied machine learning to prediction have ended up on the wrong side of the order book at the wrong time. I don't know if this is because other algorithms figure out what they are doing and rapidly develop a counter algorithm to fleece them, or if its just savvy traders intuition about what the algorithm is keying on and manipulating it. Sort of like good RTS game players that figure out how the opponent AI is playing and start playing against its programming rather than some strategy from first principles.


Anyone know where he got all the raw data to feed his algo? Clearly he used a lot of data and the two main sources of free info i know of are google finance and yahoo finance. At least with google finance i run into issues with their api if you execute too many calls simultaneously, a bunch end up not returning any data


Not sure where he got his data, but you might want to try https://www.quandl.com/

They have a free, community, curated data set of ~3200 stocks.


Wow i have not heard of that site before - thanks!


Agreed with Quandl being a good source for financial data. Their APIs are also quite well-maintained. Within Quandl, I've found Zack's to be a good resource.

I work with both Quandl and Zack's pretty frequently, let me know if you're interested in buying large amounts of data from Zack's, and I can perhaps get you a discount from the listed prices on the Quandl website.


If I was interested in such a discount, how would I contact you? (Also, how much is a "large amount"?)


In my experience, Yahoo finance data is not reliable. In one case, I noticed that the stock price is incorrectly adjusted for dividends for all shares trading on a particular exchange. Free correct data is hard to obtain.


You can get data from Interactive Brokers; I assume most of the other brokers that provide an API will give you data too.


You can get constant updates from IG via scraping. Market prices are fractal I.e. Self similar at any scale


Interesting article. I do something related, and here's my take:

Data mining is useful because it gives you things that are predictive that you might not have considered at first, but make sense after. This is mainly due to combinatorial explosion in the potential number of formulas.

You generally have a vague idea of what might be predictive, eg cheapness vs earnings and cash flow, but there's a huge number of ways that might show up in the data, and there's a huge number of ways it might hide in the data.

So for instance an old school analyst might do a ranking of price/earnings as well as cash flow, or whatever bespoke formula desired.

A data mining approach could take all the fundamentals and generate formulas mixing the variables, yielding a number that seem to be effective. Out of those, you'd look at them and decide that they capture some thesis (low P/E, upward trend in earnings). Then you'd look at whether the formula is sensitive to small tweaks. For instance, if you regressed the last 6 earnings and it had phenomenal performance, but with 5 or 7 it wasn't, you probably conclude it's some sort of random result.

There's funds that take the mass approach to an extreme. They have huge databases, with a genetic algorithm that generates expression trees, and a battery of stats (incl backtests) to decide what works. They end up with many thousands of strategies that are a great deal more effective than your standard one-trick pony fund.


very interesting. Can you recommend any resources for someone with a fairly strong stats / programming background but no real substantive finance experience?


Igor Tulchinsky has a fund that does this. He also writes books and papers about how he does it, with everything you need to do it yourself.


There's a hedge fund built by anonymous data scientists - https://numer.ai

You can use ML to make money on encrypted stock data for free. Think Kaggle but the winning models are used to trade.


It looks like the feature set is fixed on numer.ai? If so everyone's probably developing mega ensembles (this is what netflix's competition ended up as, with teams merging because their models did better together). Compared to quantopia, where you're responsible for feature engineering too (though numer.ai is probably easier to get started, since model selection is imo the fun part).


Really successful traders spend their obtaining insider information, not massaging public data. It stands to reason that an ensemble of technical trading methods would regress towards the mean.


Some of it can be publicly available data that others just haven't thought to use. In another comment someone has mentioned the Walmart carpark satellite imagery example, where they could estimate trends in Walmart's sales by counting the number of cars in the carpark over time.

Nik Cubrilovic recently demonstrated how information leakage can be used to trade stocks, eg estimating the growth rate of the Adobe Creative Cloud customer base based on assigned customer ID numbers, before Adobe announced the figures themselves:

http://www.itnews.com.au/news/how-an-aussie-hacker-used-info...


This is exactly true. I had the pleasure of having a very short sit down with one of the world's most successful traders and he said this in not so many words.

It was more subtle than inside information. He implied that he could actually influence the outcome and hurried me on from that point.

Mind blown.


Sounds like Billions is not that far off...


Billions is more documentary than tv show.


Never heard of it.


Please do share more if you have anything.


It seemed like the game is not necessarily rigged, but information is power. Regular traders are playing for table scraps compared to the really huge people who play politics, not markets.

For example, http://priceonomics.com/the-trade-of-the-century-when-george...

What did Soros know about what would happen and what did he actually cause to happen? The story sounds nice on the surface (I made a multi-billion dollar bet on a quote from some guy - lol), but I wouldn't be surprised if there was some back channelling going on.


You probably mean "information that others do not have". That is not the same thing as "insider information".


Using insider information is illegal. If you mean insider information then this is total nonsense. Sure, there are a few that do illegal things (and inevitably get caught since there is so much monitoring going on).


If you read "Lessons of a Street Addict", Jim Cramer pretty much says directly he used to call his contacts at goldman to get info on trades. Reading that convinced me that day trading is for suckers without inside information.


It's only illegal if you are an insider or got it from an insider who realized a gain on it. There's a lot of misconceptions here about what insider trading actually is.


Exactly, when I heard drugs usage was a problem in some communities I too knew that was impossible because selling and using drugs is illegal.

Sure, there are a few that do illegal things (and inevitably get caught since there is so much monitoring going on).


Your middlebrow dismissal doesn't work here because the parent didn't say it's not a problem or that it doesn't happen. The parent is refuting a point that successful traders exclusively become that way by trading on illegally obtained information.


The point being refuted is a straw man - nobody said anything about 'exclusively'.


No it isn't. The original comment might not have said "all successful traders spend their time...insider information" but the implication is there as it was stated. Given what the grandparent comment replied with, it appears I wasn't the only one who inferred that message.


Just because more than one person infers something doesn't mean it's really there.


I'll invoke the Black Swan (https://en.wikipedia.org/wiki/The_Black_Swan_%28Taleb_book%2...) since it hasn't been done yet in this thread.


...spend their time and resources...


Hello

I'm Kris, the guy who wrote the article that started this thread. Thanks to all who have read my article and taken the time to comment. In the context of my motivation for starting my blog, it means a lot. I'm an engineer who became interested in quantitative finance and machine learning a few years ago. I learned how to code and apply my maths and stats knowledge to finance independently - no formal training whatsoever. This meant that for a long time I was conducting research and developing trading systems in a vacuum; I had no one to bounce ideas off or learn from. So I started writing about what I was doing in the hopes of getting some feedback. So thank you all for providing some. The insights were immensely valuable and I learned a lot.

I thought it would be useful to respond to some of the comments.

mathgenius brought up the extremely valid point that regular k-fold cross validation in a time series context doesn't make sense since the data is autocorrelated, not iid. I no longer use this approach for time series data, instead favoring Rob Hyndman's time series cross validation approach, also known as forward chaining. I believe this approach is the best representation of a real trading environment. The issue becomes deciding how large the rolling window of training data should be - older data may be obsolete, but excluding too much history can lead to not enough training instances.

dpweb raises a good point too, namely that just because your model performed well on past data, even if that data was out of sample, there is no guarantee that the future will be sufficiently like the past, meaning that your model may well become useless at some point in time (possibly very quickly). This is a valid point, but no reason to abandon the markets. It does however require that any algorithm's live performance be objectively monitored such that the level of deviation from expected performance can be statistically quantified. Once a pre-determined confidence level in the model's obsolescence is reached, it should be removed from the portfolio.

mcbrown's comment about publication bias is a good one too. Even worse, I've personally developed hundreds of trading systems that I haven't published. Other bloggers and publishers have most likely also done the same. This form of selection bias is very likely rampant, and is especially applicable to models 'discovered' using machine learning techniques that may not be rooted in traditional economic or financial principles. The moral: absent some form of robust accounting for selection bias, view all of these types of systems with a healthy dose of skepticism, and the published performance as a theoretical upper limit to what could be achieved in practice.

hendzen's point about partnering with a fund or proprietary trading company rather than running your reliable, alpha generating strategy yourself is also a valid one. I have happily found this out for myself recently.

Also, lordnacho is spot on regarding his take on the utility of data mining in finance.

Thanks again for all the comments!


Never understood why anyone would spend time creating any trading method given even if it did work (possible, but unlikely) the SEC would audit you and then leak how you were making the outperforming returns.

Welcome any thoughts, in part because legally beating the market is possible, just don't get the SEC & OPSEC aspect.


Why would the SEC audit you? Just by random chance there are many people outperforming the market. They can't audit you just for outperforming.

If they do audit you, how will they discover how you are generating your trading decisions? Their remit is to make sure you aren't doing something illegal. There's no reason they would understand what you were doing in anything other than a superficial way.

Also, something can be profitable, and obviously so, without being easily reproducible. For instance there are firms that do simple footrace arbitrage on the same security between different exchanges. Not hard to understand, but you still can't do it. There's a whole spectrum of strategies that are on a frontier on the map of easy-to-understand vs easy-to-implement.

Besides all that, I think even if you were to learn about a way to beat the market, the way you found out might lead you to be very skeptical of whatever was proposed. If a guy is selling it on a website, you will probably not believe him, right? And if he showed you backtests that worked, you would suspect they were generated from a random generator of some sort. And if he then shows you the math, you would almost certainly find fault with it. Why did he do this or that transformation on the data? Must be random...


Few years back, SEC started being very agressive about finding entities making above average returns; my understanding is that if over a set amount of transactions you're making over 30% that you will get "knocked" and the auditors have zero reason not to leak the information. Best example I know is the Walmart parking lot satellite imagery analysis; happy to dig up a link.


That sounds like weeding out insider trading, not finding people with legitimate market beating strategies.

If you're trading on confidential information, your profile will look very interesting indeed. You'll be trading near announcements, and you'll be right all the time. Your turnover vs profit and number of trades will be through the roof. By contrast quant shops with real models will be using the law of large numbers.


I'd love a link as well as additional examples, if you can think of any.


Googling "walmart parking lot analysis" yielded the following as the first result.

http://www.cnbc.com/id/38722872


Thanks. It has no information about the technique being leaked by the SEC, though.


Not sure about SEC leaking anything, but satellite data is a tool for investors to use satellite imagery, and image processing to see things like how many cars are in several big branches, make assumptions and correlations to spending, and then act on the information before Walmart releases a quarterly statement, sort of.

Number of container ships docked or leaving port around China can forecast trends in China's exports, again before any official numbers are made available. I don't see this as insider trading. You pay for the satellite time, you gamble on your data analysis, and you either win or lose. If it were certain, others do the same, and the edge is lost quickly by market adjustments.


The problem isn't coming up with an algorithm that works (i.e. more wins than losses).

The difficulty is gaining confidence in your algo and determining when to move from paper trading to actual trading.

You run into counter-intuitive things while training a neural net, for example. You'd think more training data would be good, but when training neural nets, you actually want to use as little data as possible while still creating an ideal ROC curve.


> The problem isn't coming up with an algorithm that works (i.e. more wins than losses).

An algorithm that works would also include the ability to limit losses. An algorithm might be correct 9 out of 10 times, but may lose more in a single transaction than what it earned in those 9 winning transactions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: