Originally Posted By: matthew_k
Have you considered if you can work with SQLite? It's the defacto standard in "SQL on files", but it doesn't sound like it's a drop in solution.


I could be wrong, but I think our requirement that it be able to partition data across multiple files on disk rules out SQlite, which doesn't have any partitioning support that I'm aware of.

Quote:
Can you be more specific about the data? Are the records index/value mappings? Are the indexes always the same? What kind of indexing are you currently using?


It's network flow data, much like Cisco Netflow. So, a 5-tuple of (src ip, src port, dest ip, dest port, protocol) plus start time, end time, and a tiny bit of extra data like TCP flag mask, next hop IP, etc. Each file contains an hour's worth of data from one sensor.

There's no indexing to speak of except for the directory and file structure, which is (generally)

/class/type/year/month/day/sensor-hour

"class" is usually some segment of the traffic based on where the sensors are located (at the network core, on the border, etc.) and "type" is usually used to indicate the direction of the traffic (in/out of the network.)

So, it's trivially easy to get a particular sensor's data, or all data for a particular hour/day, but getting all flows from a particular IP over a wide time range is a brute force chug through the files.
_________________________
- Tony C
my empeg stuff