Debug School

rakesh kumar
rakesh kumar

Posted on

Different kind of data types handled by data-lake

Different Types of Data Handled by a Data Lake
Structured Data
Semi-Structured Data
Unstructured Data
Streaming Data
Log Data
Time-Series Data
Machine Learning Data

How data lake storage huge data
Practical way how data stores in datalake

Objective and MCQ QUESTION

1️⃣ Structured Data

This is highly organized data in rows and columns.

Examples:

SQL tables (MySQL, PostgreSQL)

Excel sheets

CSV files

ERP transaction records

Booking tables

Payment records

Characteristics:

Fixed schema
Enter fullscreen mode Exit fullscreen mode

Columns and data types defined

Easy to query using SQL

2️⃣ Semi-Structured Data

Data that doesn’t follow strict table structure but still has some organization.

Examples:

JSON files

XML

API responses

NoSQL data (MongoDB)

Event logs (clickstream)
Enter fullscreen mode Exit fullscreen mode

Characteristics:

Flexible schema

Nested fields

Schema can change over time

Very common in:

Web applications

Mobile apps

Microservices
Enter fullscreen mode Exit fullscreen mode

3️⃣ Unstructured Data

Data with no predefined format.

Examples:

Images (JPG, PNG)

Videos (MP4)

Audio files

PDFs

Medical reports

Emails

Chat transcripts

Social media posts
Enter fullscreen mode Exit fullscreen mode

Characteristics:

Cannot be stored in traditional tables easily

Used in AI / ML / NLP systems

This is where Data Lakes shine compared to databases.

4️⃣ Streaming Data

Real-time continuously generated data.

Examples:

IoT sensor data

App click events

Payment gateway events

GPS tracking

Server logs
Enter fullscreen mode Exit fullscreen mode

Characteristics:

High velocity

Time-based

Needs real-time processing

Stored in:

Kafka → Data Lake

EventHub → Data Lake
Enter fullscreen mode Exit fullscreen mode

5️⃣ Log Data

System-generated operational data.

Examples:

Web server logs

Error logs

Application logs

Security logs
Enter fullscreen mode Exit fullscreen mode

Very useful for:

Monitoring

Fraud detection

Performance analysis
Enter fullscreen mode Exit fullscreen mode

6️⃣ Time-Series Data

Data indexed by timestamp.

Examples:

Stock prices

Temperature records

Health monitoring signals

User activity over time
Enter fullscreen mode Exit fullscreen mode

Often used in:

Forecasting

Trend analysis

7️⃣ Machine Learning Data

Used specifically for AI.

Examples:

Training datasets

Feature tables

Model output

Predictions
Enter fullscreen mode Exit fullscreen mode

Historical behavioral data

Data Lake is ideal for storing large ML datasets.

How data lake storgae huge data

Data Lake Does NOT Store Data Like a Traditional Database

A Data Lake usually sits on top of:

AWS S3

Azure Data Lake Storage (ADLS)

Google Cloud Storage
Enter fullscreen mode Exit fullscreen mode

These are distributed object storage systems, not single machines.

That means:

👉 Data is stored across thousands of physical servers.
Enter fullscreen mode Exit fullscreen mode

2️⃣ Horizontal Scaling (Unlimited Expansion)

Traditional database:

Limited by server disk size

Needs vertical scaling (bigger machine)
Enter fullscreen mode Exit fullscreen mode

Data Lake:

Adds more storage nodes automatically

Scales horizontally

No practical upper limit (petabytes or exabytes)
Enter fullscreen mode Exit fullscreen mode

If you upload 10 TB more data:

It simply distributes it across more storage blocks.

3️⃣ Data is Stored as Files (Not Tables)

Instead of storing rows in a database engine, Data Lake stores:

Parquet files

ORC files

Delta files

JSON / CSV

Images / Videos
Enter fullscreen mode Exit fullscreen mode

Example:

/datalake/bookings/2026/02/21/part-0001.parquet
Enter fullscreen mode Exit fullscreen mode

Data is split into many small files, not one huge file.

This makes:

Faster parallel reading

Better distribution

Efficient storage
Enter fullscreen mode Exit fullscreen mode

4️⃣ Data Compression

Modern formats like Parquet / ORC / Delta:

Compress data automatically

Store column-wise (columnar storage)

Reduce storage cost significantly
Enter fullscreen mode Exit fullscreen mode

Example:
1 TB raw data → may become 200–300 GB after compression.

5️⃣ Separation of Storage and Compute

In traditional systems:

Storage + compute are tied together.

In Data Lake architecture:

Storage is separate.
Enter fullscreen mode Exit fullscreen mode

Compute (Databricks, Spark) reads data when needed.

So:

Storage can grow independently.

Compute can scale only when required.
Enter fullscreen mode Exit fullscreen mode

This makes it cost-efficient.

6️⃣ Data Partitioning

Data is partitioned by:

Date

Region

User ID

Category
Enter fullscreen mode Exit fullscreen mode

Example:

/bookings/year=2026/month=02/day=21/
Enter fullscreen mode Exit fullscreen mode

So system does not scan entire dataset — only relevant partition.

That’s how performance stays high even with huge data.

7️⃣ Tiered Storage (Cost Optimization)

Cloud storage automatically moves old data to cheaper storage tiers:

Hot (frequent access)

Warm

Cold / Archive
Enter fullscreen mode Exit fullscreen mode

So even petabytes remain affordable.

🔥 Why It Can Handle Huge Data

Because it is:

Distributed

Parallel

Compressed

Partitioned

Cloud-native

Decoupled from compute
Enter fullscreen mode Exit fullscreen mode

Practical way how data stores in datalake

Imagine AWS S3 or Azure Data Lake like this:

datalake/
   bookings/
      year=2026/
         month=01/
            part-0001.parquet
            part-0002.parquet
         month=02/
            part-0001.parquet
   users/
      part-0001.parquet
   logs/
      2026-02-20.json
      2026-02-21.json
Enter fullscreen mode Exit fullscreen mode

👉 It stores files, not database rows.
👉 Each file may contain millions of records.
👉 Files are distributed across thousands of servers.

🔹 Step 2: How It Stores Huge Data (Distributed Storage)

When you upload a file to S3:

It is automatically split internally into blocks.

Those blocks are stored across multiple data centers.

If one server fails → data is still safe.

You don’t manage servers — cloud handles it.
Enter fullscreen mode Exit fullscreen mode

Example:


Upload 5 TB of data →
S3 automatically spreads it across many storage nodes.
Enter fullscreen mode Exit fullscreen mode

That’s why it can scale to petabytes.

🔹 Step 3: Practical Example – Writing Large Data Using Spark

Let’s simulate large data in Databricks or Spark.

Create 10 million records

from pyspark.sql import functions as F
Enter fullscreen mode Exit fullscreen mode
df = spark.range(0, 10_000_000) \
    .withColumn("user_id", F.col("id")) \
    .withColumn("amount", F.rand()*1000) \
    .withColumn("year", F.lit(2026)) \
    .withColumn("month", (F.col("id") % 12) + 1)

df.write \
  .partitionBy("year","month") \
  .format("parquet") \
  .mode("overwrite") \
  .save("/mnt/datalake/bookings")
Enter fullscreen mode Exit fullscreen mode

Now check storage.

You’ll see something like:

/bookings/year=2026/month=1/part-00000.parquet
/bookings/year=2026/month=2/part-00001.parquet
...

Instead of 1 huge file,
Spark created many small distributed files.

That’s scalability.

🔹 Step 4: See Compression in Action

If you store as CSV:

df.write.mode("overwrite").csv("/mnt/datalake/csv_test")
Enter fullscreen mode Exit fullscreen mode

Now compare with Parquet:

df.write.mode("overwrite").parquet("/mnt/datalake/parquet_test")
Enter fullscreen mode Exit fullscreen mode

You’ll notice:

Parquet size << CSV size

Because:

Parquet stores column-wise

Uses compression (Snappy)

Avoids repetition

Huge savings at TB scale.
Enter fullscreen mode Exit fullscreen mode

🔹 Step 5: How Partitioning Prevents Full Scans

If you query:

spark.read.parquet("/mnt/datalake/bookings") \
     .filter("year=2026 AND month=2") \
     .count()
Enter fullscreen mode Exit fullscreen mode

Spark only reads:

/year=2026/month=2/
Enter fullscreen mode Exit fullscreen mode

Not entire dataset.

This is how it handles huge data efficiently.

🔹 Step 6: Storage & Compute Separation (Very Important)

Storage:

Cheap (S3 / ADLS)

Scales infinitely
Enter fullscreen mode Exit fullscreen mode

Compute:

Spark cluster

Only runs when needed

Scales up/down automatically
Enter fullscreen mode Exit fullscreen mode

So even if you store 5 PB data,
you only pay compute when analyzing.

🔹 Step 7: Real-World Size Example

Let’s say:

1 hospital system generates 5 GB/day logs

1 year = ~1.8 TB

5 years = ~9 TB
Enter fullscreen mode Exit fullscreen mode

Data Lake handles this easily because:

It just keeps adding files

No schema restriction

No performance drop

Database would struggle at this scale.
Enter fullscreen mode Exit fullscreen mode

20 MCQs on Data Lake Data Types
1️⃣

Which type of data is stored in its original form in a Data Lake?
A) Transformed data only
B) Raw data
C) Only structured data
D) Processed data

✅ Answer: B

2️⃣

Structured data refers to:
A) Images and videos
B) Data stored in tables with schema
C) Unlabeled text files
D) Audio files

✅ Answer: B

3️⃣

Which storage format below is commonly used in Data Lakes for analytics?
A) PDF
B) Parquet
C) TXT
D) PPT

✅ Answer: B

4️⃣

Semi-structured data includes:
A) Excel tables
B) JSON and XML
C) Audio files
D) Plain text without organization

✅ Answer: B

5️⃣

Which type of data is NOT structured but contains some organization?
A) Binary logs
B) JSON files
C) Relational tables
D) Excel sheets

✅ Answer: B

6️⃣

Which type of data includes videos, images, and audio?
A) Structured
B) Metadata
C) Unstructured
D) Semi-structured

✅ Answer: C

7️⃣

Log files and event streams are examples of:
A) Structured data
B) Unstructured data
C) Time-series data
D) Meta descriptions

✅ Answer: C

8️⃣

In a Data Lake, data is typically stored in:

A) Rows only
B) Columns only
C) Files and objects
D) Temporary buffers
Enter fullscreen mode Exit fullscreen mode

✅ Answer: C

9️⃣

Which data type is often used for machine learning training?

A) Structured only
B) Unstructured only
C) Any type (structured, unstructured, semi-structured)
D) No data type
Enter fullscreen mode Exit fullscreen mode

✅ Answer: C

🔟

Object Storage in Data Lakes is usually:

A) Limited to text data
B) Used to store blobs of data (images/videos)
C) Only for SQL tables
D) Temporary
Enter fullscreen mode Exit fullscreen mode

✅ Answer: B

1️⃣1️⃣

Which format is most efficient for big data analytics in a Data Lake?

A) CSV
B) Parquet
C) JPG
D) GIF
Enter fullscreen mode Exit fullscreen mode

✅ Answer: B

1️⃣2️⃣

Time-series data is collected continuously from:

A) Books
B) Sensors and logs
C) Photos
D) Emails
Enter fullscreen mode Exit fullscreen mode

✅ Answer: B

1️⃣3️⃣

Which data type would a clinical image like X-ray fall under?

A) Structured
B) Semi-structured
C) Unstructured
D) Meta data
Enter fullscreen mode Exit fullscreen mode

✅ Answer: C

1️⃣4️⃣

Web clickstream data is usually stored as:

A) Excel sheets
B) JSON logs
C) CSV only
D) PDF
Enter fullscreen mode Exit fullscreen mode

✅ Answer: B

1️⃣5️⃣

Which data type can be nested and hierarchical?

A) XML
B) MP3
C) MOV
D) PNG
Enter fullscreen mode Exit fullscreen mode

✅ Answer: A

1️⃣6️⃣

Which type of data is least likely to fit in a traditional relational database?

A) Numeric tables
B) JSON logs
C) ID fields
D) SQL records
Enter fullscreen mode Exit fullscreen mode

✅ Answer: B

1️⃣7️⃣

Data Lakes handle both batch and ____ data.

A) Stream
B) Temporary
C) Processed
D) Calculation
Enter fullscreen mode Exit fullscreen mode

✅ Answer: A

1️⃣8️⃣

Which of the following is NOT a reason Data Lakes store diverse data types?

A) Scalability
B) Cost efficiency
C) Strict schema requirement
D) Flexibility
Enter fullscreen mode Exit fullscreen mode

✅ Answer: C

1️⃣9️⃣

Big data analytics often requires:

A) Only structured data
B) Multiple data types
C) No data storage
D) Only images
Enter fullscreen mode Exit fullscreen mode

✅ Answer: B

2️⃣0️⃣

Which type of data stored in Data Lake is most commonly used for AI and ML training?

A) Audio & text only
B) Only structured tables
C) All types — structured, semi-structured, unstructured
D) Only numeric columns
Enter fullscreen mode Exit fullscreen mode

✅ Answer: C

Top comments (0)