The dataset and feature repository capability lets you unify the definition and the storage of the ML data assets. Having a central repository of fresh, high-quality data assets enables shareability, discoverability, and reusability. The repository also provides data consistency for training and inference. This helps data scientists and ML researchers save time on data preparation and feature engineering, which typically take up a significant amount of their time. Key functionalities in the data and feature repository include the following:
- Enable shareability, discoverability, reusability, and versioning of data assets.
- Allow real-time ingestion and low-latency serving for event streaming and online prediction workloads.
- Allow high-throughput batch ingestion and serving for extract, transform, load (ETL) processes and model training, and for scoring workloads.
- Enable feature versioning for point-in-time queries.
- Support various data modalities, including tabular data, images, and text.
ML data assets can be managed at the entity features level or at the full dataset level. For example, a feature repository might contain an entity called customer, which includes features like age group, postal code, and gender. On the other hand, a dataset repository might include a customer churn dataset, which includes features from the customer and product entities, as well as purchase- and web-activity event logs.