DP 900 Full Course

Explore Core Data Concepts

Data Formats Overview

Data consists of facts such as numbers, descriptions, or observations. It is used to record information and can be classified into structured, semi-structured, and unstructured formats.

1. Structured Data

Definition: Adheres to a fixed schema, with uniform fields or properties.
Representation: Typically tabular, using rows for entity instances and columns for attributes.
Example:
- Customer Table: Name, Address, Phone Number.
- Product Table: Product ID, Price, Stock Level.
Storage: Often in relational databases with tables referencing each other via keys.

2. Semi-structured Data

Definition: Has some structure but allows variation between entity instances.
Common Formats: JSON, XML, YAML.

Example (JSON):

{
  "firstName": "Joe",
  "lastName": "Jones",
  "address": {
    "streetAddress": "1 Main St.",
    "city": "New York",
    "state": "NY",
    "postalCode": "10099"
  },
  "contact": [
    { "type": "home", "number": "555 123-1234" },
    { "type": "email", "address": "joe@litware.com" }
  ]
}

Flexibility: Fields can vary between instances (e.g., some customers might lack email addresses).

3. Unstructured Data

Definition: Data with no specific structure, such as documents, images, audio, and video files.
Examples:
- Text documents (e.g., Word files).
- Multimedia files (e.g., photos, videos, and audio recordings).

Data Storage Options

Organizations store data in various formats for analysis, reporting, or reference:

File Stores: Suitable for unstructured data.
Databases: Ideal for structured and semi-structured data.

Understanding these formats helps organizations choose appropriate storage solutions and optimize data processing.

Simple Notes on File Storage

1. File Storage Overview

Data can be stored locally, on removable media, or in centralized/shared storage (increasingly hosted in the cloud).
The choice of file format depends on:
- Data type: structured, semi-structured, or unstructured.
- Applications/services that will use the data.
- Human-readability vs. optimized storage/processing needs.

2. Common File Formats

Delimited Text Files
- Stores data in plain text with delimiters like commas (CSV), tabs (TSV), or spaces.
- Good for structured data, human-readable.
- Example (CSV):
```
FirstName,LastName,Email  
Joe,Jones,joe@litware.com  
Samir,Nadoy,samir@northwind.com  
```

JSON (JavaScript Object Notation)

Hierarchical format for structured and semi-structured data.
Flexible and human-readable.

Example:

{  
  "customers": [  
    {  
      "firstName": "Joe",  
      "lastName": "Jones",  
      "contact": [  
        { "type": "home", "number": "555 123-1234" },  
        { "type": "email", "address": "joe@litware.com" }  
      ]  
    }  
  ]  
}

XML (Extensible Markup Language)

Uses tags to define data.
Largely replaced by JSON but still used in some systems.

Example:

<Customers>  
  <Customer name="Joe" lastName="Jones">  
    <ContactDetails>  
      <Contact type="email" address="joe@litware.com"/>  
    </ContactDetails>  
  </Customer>  
</Customers>

3. Binary Large Objects (BLOBs)

Stores unstructured data like images, videos, and documents as raw binary.
Requires applications to interpret the binary data.

4. Optimized File Formats

Designed for efficient storage, compression, and processing:
- Avro
  - Row-based format with a JSON header and binary data.
  - Compresses data efficiently, good for minimizing storage and network usage.
- ORC (Optimized Row Columnar)
  - Column-based format developed for Apache Hive.
  - Stores data in "stripes" with indexes and statistical information for each column.
- Parquet
  - Columnar format with "row groups" for nested data types.
  - Highly efficient compression and encoding for large datasets.

Key Takeaway:
Choose file formats based on the type of data, application requirements, and the need for storage efficiency or human readability.

Simple Notes on Databases

1. What is a Database?

A system to store and query data.
Unlike a file system, a database is specialized for managing structured or unstructured data records.

2. Relational Databases (SQL)

Key Features:
- Data is stored in tables (entities like customers, products, orders).
- Each record has a primary key to uniquely identify it.
- Tables are linked using primary keys and foreign keys.
- Follows normalization to avoid duplicate data.
- Uses Structured Query Language (SQL) for management and queries.
Example:
- A customer's details are stored in one table, and their sales orders reference the customer using their primary key.

3. Non-Relational Databases (NoSQL)

Doesn’t follow the traditional relational schema.
Four Common Types:
1. Key-Value Databases:
  - Data is stored as unique keys and associated values.
  - Example:
    - Key: user123
    - Value: {name: "John", age: 30}
2. Document Databases:
  - A form of key-value database where the value is a JSON document.
  - Optimized for parsing and querying JSON data.
3. Column-Family Databases:
  - Stores data in rows and columns but groups columns into families.
  - Columns in the same family are logically related.
  - Example: Columns like Name and Age might be grouped in a "PersonalInfo" family.
4. Graph Databases:
  - Data entities are represented as nodes.
  - Relationships between entities are stored as links.
  - Useful for representing networks, such as social connections or transport systems.

Key Takeaway:

Relational databases are ideal for structured data and use SQL.
Non-relational databases are flexible, handle various data formats, and include key-value, document, column-family, and graph databases.

Notes on Transactional Data Processing

1. What is Transactional Data Processing?

A system that records transactions representing specific events (e.g., financial transfers, retail payments).
Designed for high-volume, fast processing of transactions.
Often referred to as Online Transactional Processing (OLTP).

2. Key Features of OLTP Systems:

Optimized for CRUD operations:
- Create, Retrieve, Update, Delete.
Ensures data integrity through transactional properties.

3. ACID Properties of Transactions:

Atomicity:
- Each transaction is all-or-nothing.
- Example: Transferring money between accounts succeeds or fails entirely.
Consistency:
- Database moves from one valid state to another.
- Example: Post-transaction balances reflect accurate funds transfer.
Isolation:
- Concurrent transactions don’t interfere with each other.
- Example: A balance check during a transfer shows consistent results.
Durability:
- Committed transactions persist even after system shutdown.
- Example: Updated balances are saved even if the database restarts.

4. Use Cases:

Line of Business (LOB) Applications:
- Support live business operations like banking, retail, and inventory management.

Key Takeaway:
OLTP systems are critical for managing real-time, high-volume transactions with a focus on speed, reliability, and data consistency.

Analytical Data Processing Overview

Analytical data processing involves systems designed primarily for read-only or read-mostly operations, enabling analysis of vast volumes of historical data and business metrics. These systems focus on deriving insights rather than updating data.

Common Architecture for Analytical Processing

Data Ingestion and Transformation:
- ETL Process: Extract, Transform, Load.
- Data Lake: Stores raw, file-based data for analysis.
Structured Storage:
- Data Lakehouse: Combines scalable storage of a data lake with SQL querying capabilities of a data warehouse.
- Data Warehouse: Relational schema optimized for read-heavy workloads (e.g., reports, visualizations).
Aggregation and Modeling:
- OLAP Model (Cubes): Pre-aggregated data for fast analytical queries.
- Dimensions and Measures: For example, sales revenue aggregated by date, customer, and product.
Analytics and Reporting:
- Data is queried for reports, dashboards, and visualizations to support decision-making.

Key Components

Data Lakes:
- Ideal for large-scale, raw data storage.
- Used by data scientists for exploratory analysis.
Data Warehouses:
- Relational storage for structured data.
- Enables efficient querying and reporting for analysts.
Data Lakehouses:
- Combines scalability and flexibility of lakes with querying performance of warehouses.
OLAP Models:
- Pre-calculated aggregates for fast querying.
- Enables hierarchical drill-down (e.g., sales by region → city → address).

User Roles in Analytical Work

Data Scientists:
- Work directly with raw data in a data lake.
- Build models and perform exploratory analysis.
Data Analysts:
- Query data in a data warehouse.
- Create detailed reports and visualizations.
Business Users:
- Use pre-aggregated data in dashboards or reports.
- Focus on decision-making based on summarized insights.

Benefits of Analytical Processing

Scalable Insights: Handle large data volumes efficiently.
Optimized Performance: Pre-aggregated models enable fast query responses.
Hierarchical Analysis: Drill-down into specific details or zoom out for broader trends.

This architecture supports diverse analytical needs across roles, ensuring efficient data utilization and insightful decision-making.

Shivaram Babar

Search This Blog