A couple of pointers. One on data, one on that RAM usage.
First I’ll go with the RAM usage because this is hackernews and everyone loves algorithms.
—-
There are a lot of libraries out there that do technical analysis, and most of them are designed for batch processing. TALib is an example - it works on large data sets but is not appropriate for live trading because it repeats calculations over and over and over again. If you have 10000 datapoints and calculate indicators, it’ll calculate 10000 of them. Add one more bar, now you have a dataset of 10001 items, which TALib will calculate the indicators on from scratch. Or maybe you just feed the last 10000, and still perform that calculation over all of those, but save that 10001st oldest one. Either way, it’s bad. Same goes for every library I’ve seen, presumably because no one would open source a production grade indicator generation system.
The production approach to this is somewhat different. Turn your features into state engines. Most features are just running calculations that are very easy to perform once per bar.
Moving averages are a perfect example - for a moving average of N bars, store N items in a fixed size array on the stack and keep track of the latest index to be written to. When a new value (X) comes in, increment that index, and grab the value (Y). Modify your old mean by adding (X-Y)/N. Then write X over the old value Y (It’s a ring buffer).
For EMAs, it’s even easier, because you just need to keep track of a numerator and a denominator - nothing else is needed. On a new value X with the scaling factor A (such that A^halflife = 0.5), the numerator N becomes NA+X and the denominator D becomes DA+1. Divide the two and you’ve got your new value.
If an indicator depends on another, don’t recalculate it. Break the indicators down into fundamental calculations and you’ll often find a lot of redundant calculations being done. Rearrange it all so it fits. Automate that process if you enjoy that kind of thing like I do.
Most indicators can be handled this way. The indicators that can’t - are rarely useful. After all, what you’re tracking is the evolving state of the market, and if you’re doing gymnastics over an indefinite number of bars, it probably doesn’t mean much.
The end result is that you don’t end up accumulating memory throughout the day. You receive a bar, you throw it through your indicator generators (all of which using a fixed size of memory), then you discard the bar and wait for the next. Save state at the end of the day and load that on the following market day.
The result will be a speed up like you couldn’t imagine. I promise. I run an indicator generation engine in a container capped at 40MB ram on one CPU, and it generates hundreds in much less than a millisecond after the bar arrives.
—-
Now, onto data. I recommend cleaning your data. You have a screenshot of a Tesla chart in there with some funky highs/lows every so often. It has been a long time since I’ve worked with US equities (and gladly so, it’s a mess of a system!) but the following is the best of my recollection.
The trades you receive will come from several sources. For US stocks, there are several different venues that operate their own order books. These will operate as typical markets between the open and close of the day. By typical I mean that the bid and ask represent what you’d get if you market order instantly (which you can’t), and the market trades on them have to take from the bid and ask side of the order book (formed by people placing limit orders).
However. There’s also the ADF - the Alternative Display Facility. This is the DIY of trade reporting (and quote reporting, but no one does). If someone sells some shares to their grandmother for a low price in exchange for the recipe to her famous apple pie, the ADF is where they can tell other participants about that trade, manually, subject to fat finger errors, prices of weird fractions of cents, and very relaxed constraints on timing.
It’s also where dark pools post trades.
The problem with this is that this data has no direct relationship to the rest of the market. This is why, every so often, you’ll see those blips.
If you don’t clean them, they’ll play havoc with any indicator that uses highs and lows.
The other problem is that ADF trades - at least when I last analysed this very issue - are not rare. They make up a large fraction of trades. So my approach was to clean ADF trades more rigorously than the venues by matching them against prior prices from an ADF-free background.
Are you using number of trades or sum of trade sizes?
Because if the former, there is little distinction between me firing off two market buys of 100 shares each in quick succession vs me firing off one market buy of 200 shares, so your sampling shouldn’t be impacted by the difference.
I’m not altogether convinced by volume sampling. It’s an idea popularised by De Prado, but I’ve never seen it actually work in practice. It makes you trade more when the market is going through turmoil, it makes you trade more over time (as volume per day generally increases), and I haven’t seen any evidence of the importance of information content.
If you’re trading based on patterns in the market, it’s easy to lead yourself to believe that you’re predicting the market. That isn’t the case, though - the market is formed of many many independent people making guesses.
The thing you’re actually doing is predicting what other people are going to predict. If there’s an established pattern, like some moving average crossing another or a wedge or anything else like that, the reason it tends to complete is not mystical - it completes because other people see the pattern, think it’s going to go up (or down), buy (or sell), and then that has the effect of pushing the market in that direction (it also means that anyone late to the game can’t benefit from the movement).
As such, the strongest strategy when trying to use technical analysis to determine possible market moves is to use the resolution that everyone else is using. This is overwhelmingly time-based. There are people trading in the 1s regime, the 1m regime, the 15m regime, etc, and they’ll often stick to that and execute trades with a proportional rollout time and aim for a proportional profit.
If you pick just a random number of trades or amount of volume that suits you, you can easily find that you are out of sync, competing against no one in particular, and you’ll see that it’s impossible to find a pattern.
Many other people use price levels. So they’ll have their limits and stops at round numbers, or at percentage changes on the day, week, month, etc. So there is an argument for price bars too.
Do you mind sending me an email and maybe we could chat?
Thank you for your amazing comments. I really like your idea about indicators and saving state. I'll give that a try! Yeah, Marcos López de Prado is actually where I read about the tick bars and sampling at higher rates. You need like 2 phd's to read his books though. haha. I am doing this based on X number of ticks and not even looking at volume. I was using tick count as an indicator in that you can really see patterns when activity picks up. This sync issue is really really interesting and I'll explore this.
"The thing you’re actually doing is predicting what other people are going to predict." I think I've actually seen this in the data. In that you can see this mini-cycles almost when you really zoom into a fast moving stock. I'll check out. Both your comments are amazing and it 100% shows you know what you're talking about.
A couple of pointers. One on data, one on that RAM usage.
First I’ll go with the RAM usage because this is hackernews and everyone loves algorithms.
—-
There are a lot of libraries out there that do technical analysis, and most of them are designed for batch processing. TALib is an example - it works on large data sets but is not appropriate for live trading because it repeats calculations over and over and over again. If you have 10000 datapoints and calculate indicators, it’ll calculate 10000 of them. Add one more bar, now you have a dataset of 10001 items, which TALib will calculate the indicators on from scratch. Or maybe you just feed the last 10000, and still perform that calculation over all of those, but save that 10001st oldest one. Either way, it’s bad. Same goes for every library I’ve seen, presumably because no one would open source a production grade indicator generation system.
The production approach to this is somewhat different. Turn your features into state engines. Most features are just running calculations that are very easy to perform once per bar.
Moving averages are a perfect example - for a moving average of N bars, store N items in a fixed size array on the stack and keep track of the latest index to be written to. When a new value (X) comes in, increment that index, and grab the value (Y). Modify your old mean by adding (X-Y)/N. Then write X over the old value Y (It’s a ring buffer).
For EMAs, it’s even easier, because you just need to keep track of a numerator and a denominator - nothing else is needed. On a new value X with the scaling factor A (such that A^halflife = 0.5), the numerator N becomes NA+X and the denominator D becomes DA+1. Divide the two and you’ve got your new value.
If an indicator depends on another, don’t recalculate it. Break the indicators down into fundamental calculations and you’ll often find a lot of redundant calculations being done. Rearrange it all so it fits. Automate that process if you enjoy that kind of thing like I do.
Most indicators can be handled this way. The indicators that can’t - are rarely useful. After all, what you’re tracking is the evolving state of the market, and if you’re doing gymnastics over an indefinite number of bars, it probably doesn’t mean much.
The end result is that you don’t end up accumulating memory throughout the day. You receive a bar, you throw it through your indicator generators (all of which using a fixed size of memory), then you discard the bar and wait for the next. Save state at the end of the day and load that on the following market day.
The result will be a speed up like you couldn’t imagine. I promise. I run an indicator generation engine in a container capped at 40MB ram on one CPU, and it generates hundreds in much less than a millisecond after the bar arrives.
—-
Now, onto data. I recommend cleaning your data. You have a screenshot of a Tesla chart in there with some funky highs/lows every so often. It has been a long time since I’ve worked with US equities (and gladly so, it’s a mess of a system!) but the following is the best of my recollection.
The trades you receive will come from several sources. For US stocks, there are several different venues that operate their own order books. These will operate as typical markets between the open and close of the day. By typical I mean that the bid and ask represent what you’d get if you market order instantly (which you can’t), and the market trades on them have to take from the bid and ask side of the order book (formed by people placing limit orders).
However. There’s also the ADF - the Alternative Display Facility. This is the DIY of trade reporting (and quote reporting, but no one does). If someone sells some shares to their grandmother for a low price in exchange for the recipe to her famous apple pie, the ADF is where they can tell other participants about that trade, manually, subject to fat finger errors, prices of weird fractions of cents, and very relaxed constraints on timing.
It’s also where dark pools post trades.
The problem with this is that this data has no direct relationship to the rest of the market. This is why, every so often, you’ll see those blips.
If you don’t clean them, they’ll play havoc with any indicator that uses highs and lows.
The other problem is that ADF trades - at least when I last analysed this very issue - are not rare. They make up a large fraction of trades. So my approach was to clean ADF trades more rigorously than the venues by matching them against prior prices from an ADF-free background.