Explore Core Data Concepts
Data Formats Overview
Data consists of facts such as numbers, descriptions, or observations. It is used to record information and can be classified into structured, semi-structured, and unstructured formats.
1. Structured Data
- Definition: Adheres to a fixed schema, with uniform fields or properties.
- Representation: Typically tabular, using rows for entity instances and columns for attributes.
- Example:
- Customer Table: Name, Address, Phone Number.
- Product Table: Product ID, Price, Stock Level.
- Storage: Often in relational databases with tables referencing each other via keys.
2. Semi-structured Data
- Definition: Has some structure but allows variation between entity instances.
- Common Formats: JSON, XML, YAML.
- Example (JSON):
{ "firstName": "Joe", "lastName": "Jones", "address": { "streetAddress": "1 Main St.", "city": "New York", "state": "NY", "postalCode": "10099" }, "contact": [ { "type": "home", "number": "555 123-1234" }, { "type": "email", "address": "joe@litware.com" } ] } - Flexibility: Fields can vary between instances (e.g., some customers might lack email addresses).
3. Unstructured Data
- Definition: Data with no specific structure, such as documents, images, audio, and video files.
- Examples:
- Text documents (e.g., Word files).
- Multimedia files (e.g., photos, videos, and audio recordings).
Data Storage Options
Organizations store data in various formats for analysis, reporting, or reference:
- File Stores: Suitable for unstructured data.
- Databases: Ideal for structured and semi-structured data.
Understanding these formats helps organizations choose appropriate storage solutions and optimize data processing.
Simple Notes on File Storage
1. File Storage Overview
- Data can be stored locally, on removable media, or in centralized/shared storage (increasingly hosted in the cloud).
- The choice of file format depends on:
- Data type: structured, semi-structured, or unstructured.
- Applications/services that will use the data.
- Human-readability vs. optimized storage/processing needs.
2. Common File Formats
-
Delimited Text Files
- Stores data in plain text with delimiters like commas (CSV), tabs (TSV), or spaces.
- Good for structured data, human-readable.
- Example (CSV):
FirstName,LastName,Email Joe,Jones,joe@litware.com Samir,Nadoy,samir@northwind.com
-
JSON (JavaScript Object Notation)
- Hierarchical format for structured and semi-structured data.
- Flexible and human-readable.
- Example:
{ "customers": [ { "firstName": "Joe", "lastName": "Jones", "contact": [ { "type": "home", "number": "555 123-1234" }, { "type": "email", "address": "joe@litware.com" } ] } ] }
-
XML (Extensible Markup Language)
- Uses tags to define data.
- Largely replaced by JSON but still used in some systems.
- Example:
<Customers> <Customer name="Joe" lastName="Jones"> <ContactDetails> <Contact type="email" address="joe@litware.com"/> </ContactDetails> </Customer> </Customers>
3. Binary Large Objects (BLOBs)
- Stores unstructured data like images, videos, and documents as raw binary.
- Requires applications to interpret the binary data.
4. Optimized File Formats
-
Designed for efficient storage, compression, and processing:
-
Avro
- Row-based format with a JSON header and binary data.
- Compresses data efficiently, good for minimizing storage and network usage.
-
ORC (Optimized Row Columnar)
- Column-based format developed for Apache Hive.
- Stores data in "stripes" with indexes and statistical information for each column.
-
Parquet
- Columnar format with "row groups" for nested data types.
- Highly efficient compression and encoding for large datasets.
-
Key Takeaway:
Choose file formats based on the type of data, application requirements, and the need for storage efficiency or human readability.
Simple Notes on Databases
1. What is a Database?
- A system to store and query data.
- Unlike a file system, a database is specialized for managing structured or unstructured data records.
2. Relational Databases (SQL)
- Key Features:
- Data is stored in tables (entities like customers, products, orders).
- Each record has a primary key to uniquely identify it.
- Tables are linked using primary keys and foreign keys.
- Follows normalization to avoid duplicate data.
- Uses Structured Query Language (SQL) for management and queries.
- Example:
- A customer's details are stored in one table, and their sales orders reference the customer using their primary key.
3. Non-Relational Databases (NoSQL)
- Doesn’t follow the traditional relational schema.
- Four Common Types:
- Key-Value Databases:
- Data is stored as unique keys and associated values.
- Example:
- Key:
user123 - Value:
{name: "John", age: 30}
- Key:
- Document Databases:
- A form of key-value database where the value is a JSON document.
- Optimized for parsing and querying JSON data.
- Column-Family Databases:
- Stores data in rows and columns but groups columns into families.
- Columns in the same family are logically related.
- Example: Columns like
NameandAgemight be grouped in a "PersonalInfo" family.
- Graph Databases:
- Data entities are represented as nodes.
- Relationships between entities are stored as links.
- Useful for representing networks, such as social connections or transport systems.
- Key-Value Databases:
Key Takeaway:
- Relational databases are ideal for structured data and use SQL.
- Non-relational databases are flexible, handle various data formats, and include key-value, document, column-family, and graph databases.
Notes on Transactional Data Processing
1. What is Transactional Data Processing?
- A system that records transactions representing specific events (e.g., financial transfers, retail payments).
- Designed for high-volume, fast processing of transactions.
- Often referred to as Online Transactional Processing (OLTP).
2. Key Features of OLTP Systems:
- Optimized for CRUD operations:
- Create, Retrieve, Update, Delete.
- Ensures data integrity through transactional properties.
3. ACID Properties of Transactions:
-
Atomicity:
- Each transaction is all-or-nothing.
- Example: Transferring money between accounts succeeds or fails entirely.
-
Consistency:
- Database moves from one valid state to another.
- Example: Post-transaction balances reflect accurate funds transfer.
-
Isolation:
- Concurrent transactions don’t interfere with each other.
- Example: A balance check during a transfer shows consistent results.
-
Durability:
- Committed transactions persist even after system shutdown.
- Example: Updated balances are saved even if the database restarts.
4. Use Cases:
- Line of Business (LOB) Applications:
- Support live business operations like banking, retail, and inventory management.
Key Takeaway:
OLTP systems are critical for managing real-time, high-volume transactions with a focus on speed, reliability, and data consistency.
Analytical Data Processing Overview
Analytical data processing involves systems designed primarily for read-only or read-mostly operations, enabling analysis of vast volumes of historical data and business metrics. These systems focus on deriving insights rather than updating data.
Common Architecture for Analytical Processing
-
Data Ingestion and Transformation:
- ETL Process: Extract, Transform, Load.
- Data Lake: Stores raw, file-based data for analysis.
-
Structured Storage:
- Data Lakehouse: Combines scalable storage of a data lake with SQL querying capabilities of a data warehouse.
- Data Warehouse: Relational schema optimized for read-heavy workloads (e.g., reports, visualizations).
-
Aggregation and Modeling:
- OLAP Model (Cubes): Pre-aggregated data for fast analytical queries.
- Dimensions and Measures: For example, sales revenue aggregated by date, customer, and product.
-
Analytics and Reporting:
- Data is queried for reports, dashboards, and visualizations to support decision-making.
Key Components
-
Data Lakes:
- Ideal for large-scale, raw data storage.
- Used by data scientists for exploratory analysis.
-
Data Warehouses:
- Relational storage for structured data.
- Enables efficient querying and reporting for analysts.
-
Data Lakehouses:
- Combines scalability and flexibility of lakes with querying performance of warehouses.
-
OLAP Models:
- Pre-calculated aggregates for fast querying.
- Enables hierarchical drill-down (e.g., sales by region → city → address).
User Roles in Analytical Work
-
Data Scientists:
- Work directly with raw data in a data lake.
- Build models and perform exploratory analysis.
-
Data Analysts:
- Query data in a data warehouse.
- Create detailed reports and visualizations.
-
Business Users:
- Use pre-aggregated data in dashboards or reports.
- Focus on decision-making based on summarized insights.
Benefits of Analytical Processing
- Scalable Insights: Handle large data volumes efficiently.
- Optimized Performance: Pre-aggregated models enable fast query responses.
- Hierarchical Analysis: Drill-down into specific details or zoom out for broader trends.
This architecture supports diverse analytical needs across roles, ensuring efficient data utilization and insightful decision-making.
Comments
Post a Comment