Working with Records

Most non-trivial uses of LogQS will involve interacting with records which have been ingested. On this page, we’ll cover a few different aspects of working with records.

Record Auxiliary Data

First, we’ll load a log, a topic from that log which is of type sensor_msgs/Image, and a single record from that topic. We’ll dump the contents of the record to see what it looks like:

log = lqs.list.log(name="Demo Log")[0]
topic = lqs.list.topic(log_id=log.id, type_name="sensor_msgs/Image")[0]
record = lqs.list.record(log_id=log.id, topic_id=topic.id, limit=1)[0]

print(record.model_dump())

>> {'locked': False,
 'locked_by': None,
 'locked_at': None,
 'lock_token': None,
 'timestamp': 1655235727034130944,
 'created_at': datetime.datetime(2023, 12, 18, 22, 25, 10, 453947, tzinfo=TzInfo(UTC)),
 'updated_at': None,
 'deleted_at': None,
 'created_by': None,
 'updated_by': None,
 'deleted_by': None,
 'log_id': UUID('f94c2773-6075-44d3-9638-89489e99d0c0'),
 'topic_id': UUID('0f552dad-30b5-4d93-b6a2-67403527fa3a'),
 'ingestion_id': UUID('707e51ae-25a7-42ff-8ed5-9d8ed603b883'),
 'data_offset': 18122,
 'data_length': 1710802,
 'chunk_compression': None,
 'chunk_offset': None,
 'chunk_length': None,
 'source': None,
 'error': None,
 'query_data': None,
 'auxiliary_data': None,
 'raw_data': None,
 'context': None,
 'note': None}

In LogQS, records can be associated with “auxiliary” data which allows us to augment records with arbitrary JSON data stored in an object store. This data is not included on records by default, as loading the data incurs a performance hit per record, but it can be loaded by setting the include_auxiliary_data parameter to True when fetching or listing records.

Note: auxiliary data can be arbitrarily large, so loading a large amount of records with auxiliary data can be problematic (including errors related to payload limits). It’s usually best to load records with auxiliary data one at a time, or in small batches.

record = lqs.list.record(
    log_id=log.id,
    topic_id=topic.id,
    limit=1,
    include_auxiliary_data=True
)[0]
print(record.model_dump())

>> {'locked': False,
 'locked_by': None,
 'locked_at': None,
 'lock_token': None,
 'timestamp': 1655235727034130944,
 'created_at': datetime.datetime(2023, 12, 18, 22, 25, 10, 453947, tzinfo=TzInfo(UTC)),
 'updated_at': None,
 'deleted_at': None,
 'created_by': None,
 'updated_by': None,
 'deleted_by': None,
 'log_id': UUID('f94c2773-6075-44d3-9638-89489e99d0c0'),
 'topic_id': UUID('0f552dad-30b5-4d93-b6a2-67403527fa3a'),
 'ingestion_id': UUID('707e51ae-25a7-42ff-8ed5-9d8ed603b883'),
 'data_offset': 18122,
 'data_length': 1710802,
 'chunk_compression': None,
 'chunk_offset': None,
 'chunk_length': None,
 'source': None,
 'error': None,
 'query_data': None,
 'auxiliary_data': {'image': 'UklGR...e/8f0gQAAAA',
  'max_size': 640,
  'quality': 80,
  'format': 'webp'},
 'raw_data': None,
 'context': None,
 'note': None}

You should see that the auxiliary data for this record includes an ‘image’ field with a base64-encoded image. In LogQS, we automatically process certain types of data, such as images, to generate this auxiliary data on-demand. Other types of data may not have auxiliary data generated automatically, in which case a user will need to manually create it.

The LogQS client includes a variety of utils which perform common operations, including a method for loading a valid image from auxiliary data given a record:

lqs.utils.load_auxiliary_data_image(record)
../_images/03_working_with_records_7_0.png

The “thumbnail” associated with the record.

Note that the image you’d find in the auxiliary data of a record is typically downscaled and compressed, making it unsuitable for high-quality image processing. We refer to these images as “preview” images since they’re appropriate for quick reference.

If you need a full-resolution image, you’ll need to fetch and deserialize the original data from the log file.

Fetching Record Data

When we want to fetch the original log data for a record, we have to jump through a few hoops to actually get it. The record provides enough information to fetch the rest of the necessary data to fetch the orginal log data from the log file in the object store, but this is quite cumbersome.

To make this process easier, we’ve provided a utility method for fetching the record bytes given a record. Note that this process can be slow, especially when performed on a single record at a time:

record_bytes = lqs.utils.fetch_record_bytes(record)

print(record_bytes[:10])

b'`\n\x00\x00\x8e\xe4\xa8b(p'

LogQS comes with deserialization utilities for different log formats. There’s different ways of accessing these utilities, but if you’re interested in fetching and deserializing the original log data for a record, the following method is the most straightforward:

record_data = lqs.utils.get_deserialized_record_data(record)

# we omit the "data" field since it's big and not very interesting to see
print({ k: v for k, v in record_data.items() if k != "data" })

>> {'header': {'seq': 2656,
  'stamp': {'secs': 1655235726, 'nsecs': 999977000},
  'frame_id': 'crl_rzr/multisense_front/aux_camera_frame'},
 'height': 594,
 'width': 960,
 'encoding': 'bgr8',
 'is_bigendian': 0,
 'step': 2880}

Our deserialization utilities will return a dictionary with the deserialized data in a format closely matching the original data schema. In the case of sensor_msgs/Image topics, you’ll find that the dictionary looks similar to the ROS message definition.

If we want to view this image, we’ll have to do a little processing to convert the image data to a format that can be displayed in a Jupyter notebook. We’ll use the PIL library to do this:

from PIL import Image as ImagePIL

mode = "RGB" # different encodings may use different modes
img = ImagePIL.frombuffer(
    mode,
    (record_data["width"], record_data["height"]),
    bytes(record_data["data"]),
    "raw",
    mode,
    0,
    1,
)

# in this case, we actually have a BGR image, not an RGB, so we need to swap the channels
b, g, r = img.split()
img = ImagePIL.merge("RGB", (r, g, b))

img
../_images/03_working_with_records_14_0.png

The full-resolution image associated with the record.

Of course, we also offer a utility function which can do this for you:

from lqs.common.utils import get_record_image

img = get_record_image(record_data, format="PNG")
img
../_images/03_working_with_records_16_0.png

The same full-resolution image associated with the record.

Listing Records

If we need to work with more than one record in this kind of way, there are some approaches that can be useful to improve performance depending on the context. For example, if we’re interested in getting a list of records across time, but we don’t need every record within a span of time, we can use the frequency parameter to specify how many records we want to fetch per second. This can be useful for getting a representative sample of records across time without having to load every single record.

records = lqs.list.record(
    log_id=log.id,
    topic_id=topic.id,
    frequency=0.1 # 0.1 record per second, or 1 record every 10 seconds
).data

print(f"Found {len(records)} records")

>> Found 7 records

If there are many records to work with, we may need to paginate through the results to fetch them all. We can do this manually using the limit and offset parameters, or we can use the list_all() utility method to abstract away this logic for us:

records = lqs.utils.list_all(
    lqs.list.record,
    log_id=log.id,
    topic_id=topic.id,
    frequency=0.1
)

print(f"Found {len(records)} records")

Note, however, that logs can contain a large number of records, so fetching all records at once this way is not recommended unless you are sure that the number of records is manageable.

We can then proceed as we did above to fetch the original log data for each record, but the methods used above aren’t optimized for working with a batch of records (you’ll incur unnecessary overhead for each record). Instead, you’d want to use the iter_record_data utility method which takes a list of records as input and produces an iterator which yields a tuple of the record and the record’s data. This method is optimized for fetching data for multiple records at once as well as re-using lookup data and the deserialization utilities across multiple records:

for idx, (record, record_data) in enumerate(lqs.utils.iter_record_data(records, deserialize_results=True)):
    image = get_record_image(
        record_data,
        format="PNG",
    )
    image.thumbnail((200, 200)) # make them small for the sake of compactness, but the record_data is full-res
    display(image)
../_images/03_working_with_records_21_0.png
../_images/03_working_with_records_21_1.png
../_images/03_working_with_records_21_2.png
../_images/03_working_with_records_21_3.png
../_images/03_working_with_records_21_4.png
../_images/03_working_with_records_21_5.png
../_images/03_working_with_records_21_6.png

Basic Digestions

Digestions are processes managed by LogQS which provide a way to handle batches of records efficiently. Digestions are composed of four main processing steps: first, the record index information for the provided selection of records (based on topics, time ranges, etc.) is fetched from the database, then the underlying message data for each record is fetched and stored in a record “blob” in the object store, then the record blobs are processed according to the digestion workflow, and finally the record blobs are deleted from the object store.

Sometimes, you may want to work with batches of records locally, so you don’t have a managed workflow ready to process the generated record blobs, but you effectively want to process these blobs yourself. In this case, you can use the lqs.utils.iter_digestion_records method to create an iterator which will yield the record data from these blobs. The benefit to doing this (versus iterating over record data directly, as is the case in the previous section) is that, by copying only the relevant records into the object store as blobs, you can more efficiently stream the data from the object store without having to download unnecessary record data or incur the overhead of incongrugent data.

As an example, assume we’ve created a digestion which contains a single topic of image data. We want to download these images and store them as individual files locally. When we create the digestion, we specify that it should use the Basic Digestion workflow. This workflow will create the record indexes as well as the record blobs, but will stop at the finalizing state. Once it has reached the finalizing state, you could run something like the following:

import os
from lqs import LogQS
from lqs.transcode import Transcode

lqs = LogQS()
transcoder = Transcode()

digestion_id = "cf893115-e194-4b93-bf24-06fcf7c49437" # our digestion's ID
image_dir = f"digestion_images/{digestion_id}"
os.makedirs(image_dir, exist_ok=True)

topics = {}
record_iter = lqs.utils.iter_digestion_records(digestion_id)
for digestion_part_id, part_entry, record_bytes in record_iter:
    topic_id = str(part_entry.topic_id)
    if topic_id not in topics:
        topics[topic_id] = lqs.fetch.topic(topic_id).data
    topic = topics[topic_id]
    record_data = transcoder.deserialize(
        message_bytes=record_bytes,
        type_name=topic.type_name,
        type_encoding=topic.type_encoding,
        type_data=topic.type_data,
    )
    image_bytes = lqs.utils.get_record_image(record_data, format="PNG", return_bytes=True)
    with open(os.path.join(image_dir, f"{part_entry.timestamp}.png"), "wb") as f:
        # image_bytes is io.BytesIO
        f.write(image_bytes.getbuffer())

The result of running this will be a folder with all the images from the digestion stored as PNG files.

Once you’re finished with this digestion, you should transition it to completed which will then perform the final cleanup of the digestion. This is important so as to avoid storing unncessary data (the record blobs) in the object store, but the digestion _can_ be kept in the finalizing state for as long as needed if the record blobs will be used in the future.

lqs.update.digestion(digestion_id, data={"state": "completed"})

The decision to leverage digestions like this, compared to fetching record data through the iter_record_data method or on a record-by-record basis, is based on a number of variables which will be context dependent. Digestions incur some overhead in that they need to process the indexes and blobs as well as some overhead for setting up the iterator, but once this is ready, downloading data this way will be much more efficient than using the other approaches. In general, you’ll want to download record data on a record-by-record basis if you only need to download a few records, while you’d use the record iterator if you need to download many records where selecting the records beforehand isn’t feasible, while you’d use the digestion iterator if you need to download many more records and you can afford the overhead of creating the digestion.

Best Practices

There are a lot of ways to work with records, and the best approach is often context dependent, but here are some general best practices to keep in mind when working with records:

Re-Using Lookup Data and Deserialization Utilities

If you’re only working with a few records, it’s usually best to fetch the record data on a record-by-record basis using the fetch_record_bytes or get_deserialized_record_data utility methods. This is the most straightforward way to get the data for a record, and it doesn’t require any setup. However, note that these methods depend on other resources such as the record’s topic and ingestion and if these aren’t supplied, these methods will have to make additional API calls to fetch the necessary data (which can add up).

So, if you’re using fetch_record_bytes and you already have the given record’s ingestion (referenced by the record’s ingestion_id field) loaded, you can pass it in as an argument to avoid unnecessary API calls, e.g.,

ingestion = lqs.fetch.ingestion(record.ingestion_id).data
record_bytes = lqs.utils.fetch_record_bytes(record, ingestion=ingestion)

It’s important to keep in mind that records from the same log _can_ come from different ingestions, so you can’t assume that all records from a given log will share the same ingestion (however, this is often the case, especially for smaller logs).

Similarly, the get_deserialized_record_data utility depends on the record’s topic, so if you already have the topic loaded, you can pass it in as an argument to avoid unnecessary API calls, e.g.,

topic = lqs.fetch.topic(record.topic_id).data
record_data = lqs.utils.get_deserialized_record_data(record, topic=topic)

Furthermore, the get_deserialized_record_data utility uses the Transcode utilities to perform deserialization. When you deserialize a record for a given topic for the first time, the transcoder will cache the deserialization function for that topic, so subsequent deserialization of records from that topic will be faster since it can skip the step of generating the deserialization function. So, if you know you’ll be deserializing many records from the same topic, it’s usually best to instantiate a Transcode object and re-use it to deserialize the records. The get_deserialized_record_data utility accepts a transcoder argument which you can use to pass in a Transcode object, e.g.,

transcoder = Transcode()
record_data = lqs.utils.get_deserialized_record_data(record, topic=topic, transcoder=transcoder)

If you don’t pass this in, it will create a new transcoder object internally and won’t be able to re-use the cached deserialization function across multiple calls, which will result in slower performance when deserializing multiple records from the same topic.

Loading Records vs Loading Record Data

It is important to recognize the difference between loading records (e.g., using lqs.list.record or iter_topic_records) and loading record data (e.g., using fetch_record_bytes or iter_record_data). When you load records, you’re only loading the record index information which is stored in the database. This is a relatively lightweight operation and is optimized for listing and filtering through large numbers of records. However, this record index information doesn’t include the original log data for the record, so if you want to access that, you’ll need to use the record index data (found on the actual record resource) to fetch the original log data from the log file in the object store (e.g., S3), which is a much heavier operation.

More specifically, each record contains a data_offset and data_length field which specify where the original log data for that record is located within the log file in the object store. To fetch the original log data for a record, you need to use these fields to specify a byte range when fetching the log file from the object store, which will then return just the relevant portion of the log file corresponding to that record. This is what the fetch_record_bytes utility method does under the hood. Records themselves don’t directly contain the metadata needed to know where the original log data is located; this is found on the record’s ingestion resource, which is referenced by the record’s ingestion_id field. So, to fetch the original log data for a record, you need to first fetch the record’s ingestion to get the necessary metadata (e.g., the object store ID and object key for the log file), and then use the record’s data_offset and data_length fields to specify a byte range when fetching the log file from the object store.

When you actually go to pull the data from the object store, you don’t actually perform this operation through the LogQS API. Instead, you request a presigned URL from the LogQS API which you can then use to fetch the data directly from the object store (i.e., the S3 bucket). Although a bit convoluted, this approach allows the user to transfer data directly from the object store instead of through the LogQS API, which is more efficient and scalable for large data transfers. To put it another way: if your log data is stored in S3, then when you’re fetching record data, you’re actually fetching data directly from S3, not through the LogQS API, and your system will always be the bottleneck in this data transfer. Of course, this is all abstracted for you when you use the utility methods, but it’s important to understand what’s going on under the hood so that you can make informed decisions about how to work with records and record data efficiently. It is worth noting that there is quite a bit of overhead in fetching record data due to the multiple steps involved (fetching the ingestion, generating the presigned URL, fetching the data from the object store, etc.), so if you need to work with a large amount of record data, it’s imparitive to reduce this overhead as much as possible by re-using data and utilities across multiple records, or by using digestions to handle batches of records efficiently.

Record and Record Data Iteration

If you’re fetching some data to take a look at a couple of images or something, then fetching record data on a record-by-record basis is probably fine, but if you need to work with a large amount of record data (like for a heavy-duty processing task), then you’ll want to take advantage of the different iteration utilities that we provide which are optimized for working with batches of records and record data.

The iter_topic_records method provides an iterator which yields records (_not_ record data) for a given topic. This allows you to avoid loading all records into memory at once, and instead stream them one at a time. In the background, we’re efficiently paginating through the records from the database and yielding them one at a time, so it is performant way to work with large numbers of records in a serialized fashion.

The iter_record_data method provides an iterator which yields tuples of the form (record, record_data) where record is a record and record_data is actual message bytes (or deserialized record data) from the log. This method is optimized for pulling down this data and does things like re-use resources and batches requests to the object store to make this process as efficient as possible. If you need to work with the record data for a large number of records, this is usually the best way to do it.

The iter_record_data method takes an iterable of records as its main input. If you have a list of all the records you need data for, you can pass that in, but, predictably, a good way to use this is to give it an iter_topic_records generator as input, so that you can stream the record data for a topic one at a time without having to load all the records into memory at once. This is the recommended pattern for working with large amounts of record data. It will look something like this:

topic_id = "0f552dad-30b5-4d93-b6a2-67403527fa3a"
record_iter = lqs.utils.iter_topic_records(topic_id)
record_data_iter = lqs.utils.iter_record_data(record_iter)

for idx, (record, record_data) in enumerate(record_data_iter):
    # do something with the record and record data
    ...

Note that there is some overhead to using these iterators, so if you only need to work with a small number of records, it may be more efficient to just fetch the record data on a record-by-record basis using the utility methods described in the previous sections. However, if you need to work with a large amount of record data, using these iterators will be much more efficient than trying to fetch all the record data at once or on a record-by-record basis.

Additionally, it is important to handle the actual data being returned from these iterators appropriately in order to maximize throughput. For example, if you’re fetching image data for a large number of records, you’ll want to make sure that whatever processing you’re doing on the images is efficient and doesn’t become a bottleneck in the overall process. This is especially important if you’re doing something computationally intensive with the record data, like running it through a machine learning model or something. In these cases, you may want to consider batching the record data in some way to improve efficiency, or using multiprocessing to parallelize the processing of the record data across multiple CPU cores. The best approach will depend on the specifics of your use case and the resources you have available, but it’s important to keep in mind that fetching the record data is just one part of the overall process, and you need to make sure that the rest of your processing pipeline can keep up with the rate at which you’re fetching the record data in order to maximize throughput.

As an example, consider this minimal example where we fetch image record data and want to save the images to disk:

from lqs import LogQS
from lqs.transcode import Transcode

lqs = LogQS()
transcoder = Transcode()

def process_record_data(record, record_data, topic):
    deserialized_record_data = transcoder.deserialize(
        message_bytes=record_data,
        type_name=topic.type_name,
        type_encoding=topic.type_encoding,
        type_data=topic.type_data
    )

    image = lqs.utils.get_record_image(deserialized_record_data)
    image.save(f"images/{record.timestamp}.jpg", format="JPEG")

topic_id = "0f552dad-30b5-4d93-b6a2-67403527fa3a"
record_iter = lqs.utils.iter_topic_records(topic_id)
record_data_iter = lqs.utils.iter_record_data(record_iter)
for idx, (record, record_data) in enumerate(record_data_iter):
    # this is BAD: we'll slow down our entire process by performing
    # CPU-bound processing in the middle of I/O-bound processing
    process_record_data(record, record_data, topic)

This is a bad way to do this since the processing of each record is happening sequentially, so if the processing of each record takes a long time, then this will be very slow. A better way to do this would be to use multiprocessing to parallelize the processing of the record data across multiple CPU cores, e.g.,

from lqs import LogQS
from lqs.transcode import Transcode
from concurrent.futures import ProcessPoolExecutor # for parallel processing

lqs = LogQS()
transcoder = Transcode()

_lqs = None
_transcoder = None

def init_worker():
    global _lqs, _transcoder
    _lqs = LogQS()
    _transcoder = Transcode()

def process_record_data(record, record_data, topic):
    deserialized_record_data = _transcoder.deserialize(
        message_bytes=record_data,
        type_name=topic.type_name,
        type_encoding=topic.type_encoding,
        type_data=topic.type_data
    )

    image = _lqs.utils.get_record_image(deserialized_record_data)
    image.save(f"images/{record.timestamp}.jpg", format="JPEG")

# set up multiprocessing pool with initializer to avoid re-instantiating LogQS and Transcode objects for each record
futures = []
pool = ProcessPoolExecutor(max_workers=max_workers, initializer=init_worker)

topic_id = "0f552dad-30b5-4d93-b6a2-67403527fa3a"
record_iter = lqs.utils.iter_topic_records(topic_id)
record_data_iter = lqs.utils.iter_record_data(record_iter)
for idx, (record, record_data) in enumerate(record_data_iter):
    # this is GOOD: we're parallelizing the processing of the record data across multiple CPU cores
    # so we can keep up with the rate at which we're fetching the record data
    future = pool.submit(process_record_data, record, record_data, topic)
    futures.append(future)

for future in futures:
    future.result() # wait for all processing to finish

pool.shutdown()

In this example, we’re using a ProcessPoolExecutor to parallelize the processing of the record data across multiple CPU cores. We’re also using an initializer function to instantiate the LogQS and Transcode objects once per worker process, which allows us to re-use these objects across multiple records without having to re-instantiate them for each record (which would be very inefficient). This way, we can efficiently process a large number of records in parallel while fetching the record data using the iterators.

Of course, there’s virtually an infinite number of ways to optimize this kind of workflow depending on the specifics of your use case and the resources you have available. For example, it may be easier to simply store the data to disk, first, then process it from there using a separate script that doesn’t have to worry about fetching the data from the object store and can just focus on processing the data as quickly as possible, e.g.,

In one script, you could fetch the record data and save it to disk:

from lqs import LogQS

lqs = LogQS()

topic_id = "0f552dad-30b5-4d93-b6a2-67403527fa3a"
record_iter = lqs.utils.iter_topic_records(topic_id)
record_data_iter = lqs.utils.iter_record_data(record_iter)
for idx, (record, record_data) in enumerate(record_data_iter):
    with open(f"record_data/{record.timestamp}.bin", "wb") as f:
        f.write(record_data)

In another script, you could watch for new files to be added to the record_data folder and process them as they come in, e.g.,

import os
import time
from lqs import LogQS
from lqs.transcode import Transcode

lqs = LogQS()
transcoder = Transcode()
processed_files = set()

def process_record_data_file(file_path):
    with open(file_path, "rb") as f:
        record_data = f.read()

    deserialized_record_data = transcoder.deserialize(
        message_bytes=record_data,
        type_name=topic.type_name,
        type_encoding=topic.type_encoding,
        type_data=topic.type_data
    )

    image = lqs.utils.get_record_image(deserialized_record_data)
    image.save(f"images/{os.path.basename(file_path)}.jpg", format="JPEG")

    # optionally delete the file after processing
    os.remove(file_path)

while True:
    for file_name in os.listdir("record_data"):
        file_path = os.path.join("record_data", file_name)
        if file_path not in processed_files:
            process_record_data_file(file_path)
            processed_files.add(file_path)
    time.sleep(0.1)  # Sleep for a short period to avoid busy-waiting

Regardless of how you set up your workflow, the key thing to keep in mind is that fetching record data is just one part of the overall process, and you need to make sure that the rest of your processing pipeline can keep up with the rate at which you’re fetching the record data in order to maximize throughput. On reasonable compute, you can expect to be able to fetch the record data as fast as you could download the original log data from the object store, so if your processing of the record data is slower than this, then it will become a bottleneck in the overall process, and you’ll want to optimize that part of the process as much as possible (e.g., by parallelizing the processing across multiple CPU cores, or by batching the record data in some way to improve efficiency).

LogQS vs File-Based Approaches

LogQS is designed to handle large amounts of log data efficiently and easily, but if you’re using tools that are designed to work with file-based data, you may find it easier to just download the original log files from the object store and work with them locally using your existing tools. This is a valid approach, but it is worth considering the trade-offs between these two approaches before deciding which one to use:

  • LogQS works well when you only need a small subset of the data found in the original log files. For example, if you have a log file with 100 GB of data, but you only need to work with a few hundred MB of that data, then using LogQS to fetch just the relevant records and record data can be much more efficient than downloading the entire log file and working with it locally. On the other hand, if you need to work with most or all of the data in the original log files, then it may be more efficient to just download the log files and work with them locally using your existing tools.

  • LogQS strips away much of the metadata found in the original log files and only surfaces the relevant metadata on the record resources. This can be a pro or a con depending on your use case. If you only care about the core metadata that’s found on the record resources, then LogQS can provide a much cleaner and more streamlined interface for working with the data. However, if you need access to some of the other metadata that’s found in the original log files (e.g., certain ROS message fields that aren’t included in the record resources), then working with the original log files locally may be a better option.

  • If you need to reprocess the data multiple times, or if you need to work with the data in a way that isn’t well-supported by the LogQS API, then it may be more efficient to just download the original log files and work with them locally using your existing tools. This way, you can avoid the overhead of fetching record data through the LogQS API multiple times and can just work with the data directly on your local machine.

  • If you’re working in a distributed environment where multiple people or processes need to access the same data, then using LogQS can be more efficient than downloading the original log files and working with them locally. This way, you can avoid having multiple copies of the same log files floating around and can just have everyone access the data through the LogQS API.

  • If you’re working with a large amount of data and don’t have the resources to download and work with the original log files locally, then using LogQS can be a more efficient option since it allows you to work with the data without having to download it all to your local machine. It’s worth noting that LogQS allows you to start working with record data very quickly compared to first having to download the entirety of the original log file first before you can start working with the data, so if you want to be able to iterate on the data and your processing quickly, LogQS can be a good option. That is- it will never be faster to download an entire log file, first, then process it compared to iterating on the data through LogQS. The best download speed will be the same in both cases since the data is coming from the same object store, but with LogQS, you can start working with the data as soon as you fetch the relevant records and record data, whereas if you download the entire log file first, you have to wait for the entire download to finish before you can start working with the data.

In general, the best approach will depend on the specifics of your use case and the resources you have available.