The phrase Big Data gets bandied about a lot these days. A basic rubric, though, is whether you can query your data on one computer, or in a reasonable amount of time. But that’s still pretty vague, because while Excel can’t open a file with more than a million or so records, there are several in-memory databases that wouldn’t balk at the prospect.
And because of Moore’s law, our ability to collect mounds of data has kept up with processing capability for decades. So even back in 1974 (and earlier), there was a lot of active research to identify single-pass (or online) algorithms for common summary statistics.
The main concern these days is having both accurate, moving summaries—last month’s totals, hourly aggregates, etc.—as well as near-real time updates of current activity (e.g. trending tweets).
A standard approach which has been gaining ground in the past few years is the lambda architecture. In this design, incoming data is broadcast to both a batch and a stream-processing dual architecture. The batch processing provides historical and last-N aggregate reporting, while stream processing allows for updating current snapshots on the fly. The ‘lambda’ in this approach is a reference to the programming paradigm of anonymous functions that are created just-in-time. The rationale being that they are applied as needed to the data streams.
Don’t worry, though, there are plenty of details about collecting, cleaning, and aggregating data which are tough to get right no matter the scale of your data. You don’t have to have big data to have a data problem.