<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[UnleashDataBytes]]></title><description><![CDATA[A newsletter to unleash practical insights & simplify complex concepts around the Data realm]]></description><link>https://newsletter.kirankbs.com</link><image><url>https://substackcdn.com/image/fetch/$s_!emMU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c275ca9-e012-4196-ae89-0bd4ebe362aa_800x800.png</url><title>UnleashDataBytes</title><link>https://newsletter.kirankbs.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 07 Apr 2026 10:47:21 GMT</lastBuildDate><atom:link href="https://newsletter.kirankbs.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[kiran kumar]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[unleashdatabytes@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[unleashdatabytes@substack.com]]></itunes:email><itunes:name><![CDATA[kiran kumar]]></itunes:name></itunes:owner><itunes:author><![CDATA[kiran kumar]]></itunes:author><googleplay:owner><![CDATA[unleashdatabytes@substack.com]]></googleplay:owner><googleplay:email><![CDATA[unleashdatabytes@substack.com]]></googleplay:email><googleplay:author><![CDATA[kiran kumar]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Four Hallucinations and a Python Script]]></title><description><![CDATA[Copilot Hallucination over simple Databricks Job/Task Parameters]]></description><link>https://newsletter.kirankbs.com/p/four-hallucinations-and-a-python</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/four-hallucinations-and-a-python</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Fri, 27 Mar 2026 19:40:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!emMU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c275ca9-e012-4196-ae89-0bd4ebe362aa_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I asked an LLM agent to get a Databricks job ID at runtime. It confidently proposed four approaches. All four were wrong. The fix was just a few lines Python script I could have written in ten minutes.</p><div><hr></div><p>I had a custom metrics table. It was working. Batch durations, row counts, streaming heartbeats, all landing in Delta. One problem: the `job_id` and `run_id` columns were null in every row.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>These two columns exist to enable you to join custom metrics to Databricks system tables. Without them, my per-batch timing data lives in isolation. With them, one SQL join gives you batch internals correlated with job cost, cluster utilization, and run outcomes. The whole point of the table.</p><p>So I asked my LLM coding agent to fix it. What followed was an afternoon of increasingly creative hallucinations, each delivered with full confidence, each completely wrong.</p><h1>Hallucination 1: Spark conf</h1><p>The agent&#8217;s first suggestion:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">job_id = spark.conf.get(&#8221;spark.databricks.job.id&#8221;)

run_id = spark.conf.get(&#8221;spark.databricks.job.runId&#8221;)</code></pre></div><p>Sensible-looking. There are plenty of Stack Overflow answers and blog posts mentioning these keys. The agent had probably trained on hundreds of them.</p><p>The result on our serverless compute:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">ERROR: [CONFIG_NOT_AVAILABLE] Configuration spark.databricks.job.id is not available.</code></pre></div><p>Not &#8220;key not found.&#8221; Not &#8220;returns null.&#8221; A hard error with a JVM stack trace 80 lines long. This config key doesn&#8217;t exist in the Spark Connect protocol that serverless uses. The agent had no way to know that because it trained on content from the classic compute era.</p><h1>Hallucination 2: environment variables</h1><p>After the Spark conf failure, the agent pivoted to environment variables:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import os

job_id = int(os.environ[&#8221;DATABRICKS_JOB_ID&#8221;])

run_id = int(os.environ[&#8221;DATABRICKS_RUN_ID&#8221;])</code></pre></div><p>This one was interesting because the agent didn&#8217;t just suggest reading env vars. It invented the variable names. `DATABRICKS_JOB_ID` is not a real environment variable that Databricks sets. The agent generated a plausible-sounding name, wrote the code with confidence, and I deployed it.</p><p>The metrics kept showing null.</p><p>I dumped every environment variable matching &#8220;JOB&#8221;, &#8220;RUN&#8221;, or &#8220;DATABRICKS&#8221; from a running job. Here&#8217;s what Databricks actually sets:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">DATABRICKS_RUNTIME_VERSION=client.5.1

DATABRICKS_CLUSTER_LIBS_PYTHON_ROOT_DIR=python

DATABRICKS_GANGLIA_ENABLED=FALSE</code></pre></div><p>Runtime metadata. Library paths. Nothing about job or run identity. `DATABRICKS_JOB_ID` doesn&#8217;t exist. The agent made it up.</p><h1>Hallucination 3: dbutils notebook context</h1><p>Third attempt. The agent went deeper into the Databricks internals:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()

job_id = int(ctx.tags().get(&#8221;jobId&#8221;).get())

run_id = int(ctx.tags().get(&#8221;idInJob&#8221;).get())</code></pre></div><p>This is a real API. It actually works, in notebooks. But we run Python wheel tasks, not notebooks. The error:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">module &#8216;pyspark.dbutils&#8217; has no attribute &#8216;notebook&#8217;</code></pre></div><p>The `pyspark.dbutils` module exists in wheel task context but the `notebook` sub-module doesn&#8217;t load. There&#8217;s no notebook. The module doesn&#8217;t load because there&#8217;s nothing to load it for. But the agent found an API that looks right, generated the code, and moved on.</p><h1>Hallucination 4: dynamic references in spark_env_vars</h1><p>At this point I pulled up the Databricks docs myself. I found the page on dynamic value references. The agent read it too and proposed putting `{{job_id}}` in the DAB `spark_env_vars`:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">spark_env_vars:

  DATABRICKS_JOB_ID: &#8220;{{job_id}}&#8221;

  DATABRICKS_RUN_ID: &#8220;{{run_id}}&#8221;</code></pre></div><p>Two problems. First, the syntax was wrong. The correct dynamic reference is `{{job.id}}`, not `{{job_id}}`. Second, and more fundamentally, `spark_env_vars` doesn&#8217;t resolve dynamic value references at all. The values pass through as literal strings. The cluster environment showed:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">DATABRICKS_RUN_ID={{run_id}}</code></pre></div><p>Not the run ID. The literal text `{{run_id}}`.</p><p>The Databricks docs don&#8217;t say &#8220;dynamic value references don&#8217;t work in spark_env_vars.&#8221; They just don&#8217;t list spark_env_vars as a supported location. The docs describe where they do work (task parameters, job parameters), but they never explicitly say where they don&#8217;t. That silence is a trap for both humans and language models.</p><h1>The documentation problem</h1><p>The Databricks documentation for dynamic value references says you can use `{{job.id}}` in &#8220;parameters or fields that pass context into tasks.&#8221; It gives examples for notebook `base_parameters` and job-level `parameters`. For Python wheel tasks, it says &#8220;parameters defined in the task definition are passed as keyword arguments to your code.&#8221;</p><p>What it doesn&#8217;t say:</p><ul><li><p>Which specific YAML fields support resolution and which don&#8217;t</p></li><li><p>That `spark_env_vars` passes values through without resolving them</p></li><li><p>That the old `spark.databricks.job.id` conf key doesn&#8217;t work on serverless</p></li><li><p>That `dbutils.notebook` doesn&#8217;t load in non-notebook task types</p></li></ul><p>Each hallucination mapped to a gap in the documentation. The agent wasn&#8217;t generating random nonsense. It was generating reasonable-sounding answers to questions the docs leave unanswered. Incomplete docs don&#8217;t just confuse humans. They give LLMs just enough information to construct confident wrong answers.</p><h1>The human fix: stop guessing, start testing</h1><p>After four failed attempts, I did what I should have done first. I wrote a test script:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import sys

import os

print(&#8221;=== sys.argv ===&#8221;)

print(sys.argv)

print(&#8221;\n=== Job-related env vars ===&#8221;)

for key in sorted(os.environ):

    if &#8220;JOB&#8221; in key or &#8220;RUN&#8221; in key or &#8220;DATABRICKS&#8221; in key:

        print(f&#8221;  {key}={os.environ[key]}&#8221;)

print(&#8221;\n=== dbutils context ===&#8221;)

try:

    from dbruntime.databricks_repl_context import get_context

    ctx = get_context()

    print(f&#8221;  jobId={ctx.jobId}&#8221;)

    print(f&#8221;  idInJob={ctx.idInJob}&#8221;)

except Exception as e:

    print(f&#8221;  repl_context failed: {e}&#8221;)</code></pre></div><p>Just a few lines. Created a Databricks job, added job parameters with `{{job.id}}` and `{{job.run_id}}`, set the task parameters to pass them as CLI args, ran it.</p><p>The output told me everything in one shot:</p><ul><li><p>`sys.argv` had the resolved job and run IDs from the task parameters</p></li><li><p>Every env var approach was dead</p></li><li><p>Spark conf threw hard errors</p></li><li><p>`dbruntime.databricks_repl_context` actually worked too (undocumented but functional)</p></li></ul><p>Ten minutes from &#8220;let me just test this&#8221; to knowing exactly which approaches work and which don&#8217;t. Compare that to four rounds of agent suggestions, deployments, and failures.</p><h1>The working solution</h1><p>Job-level parameters with dynamic value references, referenced from task `named_parameters`. The values arrive as `sys.argv` and get parsed with argparse:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml"># DAB job definition

parameters:

  - name: job_id

    default: &#8220;{{job.id}}&#8221;

  - name: run_id

    default: &#8220;{{job.run_id}}&#8221;

tasks:

  - python_wheel_task:

      entry_point: &#8220;my-workflow&#8221;

      named_parameters:

        job_id: &#8220;{{job.parameters.job_id}}&#8221;

        run_id: &#8220;{{job.parameters.run_id}}&#8221;</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">@staticmethod

def _parse_job_context():

    import argparse, sys

    parser = argparse.ArgumentParser()

    parser.add_argument(&#8221;--job_id&#8221;, type=int, default=None)

    parser.add_argument(&#8221;--run_id&#8221;, type=int, default=None)

    args, _ = parser.parse_known_args(sys.argv[1:])

    return args.job_id, args.run_id</code></pre></div><p>We put the parsing in the base `Workflow` class. Any workflow that enables metrics gets job context automatically. The only per-workflow work is adding the `parameters` and `named_parameters` blocks to the DAB YAML.</p><h1>Not an anti-AI post</h1><p>I&#8217;m not writing this to dunk on LLMs. I use one every day. It wrote most of the boilerplate in our metrics writer. It&#8217;s genuinely good at generating code when the problem is well-understood and the patterns are common.</p><p>But there&#8217;s a specific failure mode that showed up four times in one afternoon: the agent treats documentation gaps as opportunities to interpolate. When the docs don&#8217;t say how to do something, it constructs an answer from adjacent knowledge. `spark.databricks.job.id` exists in older Databricks content, so it suggests that. `DATABRICKS_` is a common prefix for their env vars, so it invents one. The `dbutils.notebook.entry_point` chain works in notebooks, so it assumes it works everywhere.</p><p>Each interpolation sounds plausible. Each fails for a reason the agent can&#8217;t know without testing.</p><p>The fix wasn&#8217;t more prompting or a better model. It was stepping back and writing a test script. Isolating the problem. Running it. Reading the output. Deciding based on evidence instead of confidence.</p><p>That&#8217;s not a prompting skill. That&#8217;s an engineering skill. The specific one where you stop asking &#8220;what should work?&#8221; and start asking &#8220;what actually works right now, on this compute, in this runtime?&#8221;</p><p>LLMs write code. Engineers figure out which code to write. Those are different skills, and this afternoon was a good reminder that the second one isn&#8217;t going anywhere.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[One Cluster per Task — Proven, Ready, and Waiting]]></title><description><![CDATA[Part 3 of 3: Databricks Streaming Architecture]]></description><link>https://newsletter.kirankbs.com/p/one-cluster-per-task-proven-ready</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/one-cluster-per-task-proven-ready</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Wed, 11 Mar 2026 13:17:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!emMU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c275ca9-e012-4196-ae89-0bd4ebe362aa_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>By the end of <a href="https://newsletter.kirankbs.com/p/streaming-failure-models-why-it-didnt">Part 1</a> &amp; <a href="https://newsletter.kirankbs.com/p/multi-task-on-a-shared-cluster-why">Part 2</a>, we knew what the real answer was. We just hadn&#8217;t committed to it yet.</p><p>Not because it wouldn&#8217;t work. We tested it. We documented it. The code was ready. The answer was one cluster per task &#8212; true driver isolation, one JVM per pipeline, failures that cannot, by construction, spread sideways.</p><p>The reason we hadn&#8217;t switched: budget and scale. That&#8217;s a completely legitimate engineering decision.</p><h2><strong>What &#8220;real&#8221; isolation actually means</strong></h2><p>Everything in Parts 1 and 2 lived in this shape:</p><p><code>Shared Job Cluster<br>  Python Proc A    Python Proc B<br>       \               /<br>        &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;<br>        &#9474;  Spark JVM  &#9474;<br>        &#9474;  (shared)   &#9474;<br>        &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;<br>               &#9474;<br>          Executors<br></code></p><p>One driver. Everything that fails, fails together &#8212; or worse, fails silently while the rest keeps running.</p><p>One cluster per task looks like this:</p><p><code>Dedicated Clusters<br>  Task A              Task B<br>  Cluster A           Cluster B<br>  Python Proc         Python Proc<br>       &#9474;                   &#9474;<br>  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;         &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;<br>  &#9474;  JVM A  &#9474;         &#9474;  JVM B  &#9474;<br>  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;         &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;<br>       &#9474;                   &#9474;<br>  Executors           Executors</code></p><p><code><br>Task B fails &#8594; Cluster B restarts &#8594; Task A: unaffected<br></code></p><p>That last line is the one that matters. Task B crashing cannot reach Task A. No shared driver, no shared memory, no shared scheduler. The failure boundary is the cluster, not the job.</p><h2><strong>What the experiment showed</strong></h2><p>We ran it. Moved our streaming queries onto separate clusters &#8212; some as separate tasks in the same job, some as fully independent jobs. Both approaches worked.</p><p>Failures stopped spreading. When we deliberately triggered one, the other pipeline kept running correctly &#8212; not &#8220;mostly healthy.&#8221; The UI was accurate for the first time, which sounds like a minor thing until you remember how long we&#8217;d been staring at misleading dashboards.</p><p>Debugging got faster in a way that surprised me. Each cluster has its own driver logs, its own metrics, its own lifecycle. &#8220;Which pipeline failed?&#8221; stopped requiring timestamp correlation across a shared log stream.</p><p>Sizing became honest too. A heavy pipeline gets the compute it needs. A lightweight one runs on something smaller. You&#8217;re not picking a number that has to cover the worst case across everything.</p><h2><strong>The code was already ready</strong></h2><p>Something we hadn&#8217;t fully appreciated: we didn&#8217;t need to rewrite anything.</p><p>Our pipelines were structured so the same code could run in any deployment shape:</p><ul><li><p>All queries in one task, one cluster &#8212; the Part 1 setup</p></li><li><p>Multiple tasks, shared cluster &#8212; the Part 2 setup</p></li><li><p>Multiple tasks, dedicated cluster per task &#8212; the target architecture</p></li></ul><p>Deployment and configuration change, not a code change. The streaming logic, the Delta writes, the checkpointing &#8212; all identical. Only the compute allocation changed.</p><p>That was intentional. It also made the experiment cheap to run, which helped.</p><h2><strong>Why we&#8217;re waiting</strong></h2><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.kirankbs.com/subscribe?"><span>Subscribe now</span></a></p><p>Cost and scale, honestly.</p><p>Right now, our workload fits on a shared cluster without catastrophic failure &#8212; especially with <code>awaitAnyTermination</code> providing fail-fast semantics on the streaming side and task retries on the batch side. It&#8217;s not elegant. But it&#8217;s stable enough for where we are.</p><p>Once device count crosses the threshold, the calculus changes. When the operational cost of a missed or false alert exceeds the infrastructure cost of a dedicated cluster, you stop debating. We know roughly where that line is. We&#8217;re watching the numbers.</p><h2><strong>When to make the move</strong></h2><p>If any of these apply to your setup, you&#8217;re probably ready:</p><ul><li><p>A failure in one pipeline has caused a correctness problem in another</p></li><li><p>You&#8217;re spending more time debugging shared-cluster incidents than the extra cluster would cost per month</p></li><li><p>Customer impact from partial failures is measurable &#8212; missed alerts, false positives, SLA breaches</p></li><li><p>Individual pipelines have meaningfully different compute profiles and you&#8217;re sizing the shared cluster to the worst case</p></li></ul><p>None of these require a crisis. The best time to make the architecture change is before the next incident, not after.</p><h2><strong>The progression in one diagram</strong></h2><p><code>Part 1: Multi-query, shared cluster<br>&#8594; Problem: silent query failures<br>&#8594; Fix: awaitAnyTermination (band-aid)</code></p><p><code><br>Part 2: Multi-task, shared cluster<br>&#8594; Problem: still one driver, same failure modes<br>&#8594; Fix: task-level retry (better, still shared)</code></p><p><code><br>Part 3: One cluster per task<br>&#8594; True isolation<br>&#8594; Failures contained by construction<br>&#8594; Ready when the scale justifies it<br></code></p><p>The band-aids in Parts 1 and 2 weren&#8217;t mistakes. We took a known shortcut because the right answer cost more than the problem did at the time. Knowing that, and knowing what would change it &#8212; that&#8217;s what matters. We know both.</p><p>The code is sitting there, ready to go. We flip the switch when the numbers say it&#8217;s time.</p><div><hr></div><p><em>The incident that kicked off this series also involved a </em><code>ConcurrentAppendException</code><em> that went deeper than a simple retry could fix &#8212; Delta isolation levels, </em><code>isBlindAppend</code><em>, and why you can&#8217;t always do what you think you can inside </em><code>foreachBatch</code><em>. That&#8217;s the next post, and probably the most technically dense thing I&#8217;ll write this year.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Multi-Task on a Shared Cluster — Why That's Also Not Enough]]></title><description><![CDATA[Part 2 of 3 &#8212; Databricks Streaming Architecture]]></description><link>https://newsletter.kirankbs.com/p/multi-task-on-a-shared-cluster-why</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/multi-task-on-a-shared-cluster-why</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Thu, 05 Mar 2026 21:14:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!emMU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c275ca9-e012-4196-ae89-0bd4ebe362aa_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The instinct after Part 1 was obvious.</p><p>If running eight queries in one task means one failure can hide while others keep running &#8212; split them into multiple tasks. Separate concerns. Give each component its own retry boundary.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Right instinct. Wrong infrastructure assumption.</p><h2><strong>We tried it</strong></h2><p>While the multi-query incident from Part 1 was still fresh, we were already experimenting with a multi-task approach on a separate workflow. Two tasks, same shared job cluster:</p><ul><li><p><strong>Task 1</strong>: feature extraction &#8212; processing sensor data into feature tables</p></li><li><p><strong>Task 2</strong>: inference &#8212; ML model outputs written to downstream Delta tables</p></li></ul><p>Sequential dependency. Task 2 reads what Task 1 writes. Clean separation on paper.</p><p>Then Task 2 hit a wall.</p><h2><strong>The incident &#8212; external location mismatch</strong></h2><p>Task 2 was writing to a Delta table registered in Unity Catalog. The catalog entry pointed to external location A. The actual data sat at location B.</p><p>A misconfiguration. Easy to make during migration, hard to spot before it fails in production.</p><p>Task 2 failed. Task 1 kept running.</p><p>And here&#8217;s where it felt familiar: the job didn&#8217;t fail. No restart triggered. One task retrying. The other healthy. The UI said RUNNING.</p><p>Same story as Part 1. Different packaging.</p><h2><strong>The detail that changes everything: there&#8217;s still one driver</strong></h2><p>Here&#8217;s what multi-task on a shared cluster actually looks like at runtime:</p><pre><code><code>Multi-Task on a Shared Job Cluster

Task 1 (Python Process A)     Task 2 (Python Process B)
          \                           /
           \                         /
            &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
            &#9474;      Spark Driver      &#9474;
            &#9474;         JVM            &#9474;
            &#9474;    (shared by all)     &#9474;
            &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                       &#9474;
                  Executors</code></code></pre><p>Multiple Python processes. One Spark driver JVM.</p><p>Compare that to multi-query single task from Part 1:</p><pre><code><code>Multi-Query Single Task

     Python Process (single)
              &#9474;
     &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
     &#9474;    Spark Driver    &#9474;
     &#9474;        JVM         &#9474;
     &#9474;  Q1  Q2  Q3 ... Q8 &#9474;
     &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
              &#9474;
         Executors</code></code></pre><p>The difference between these two diagrams is smaller than it looks. Both share the same driver. Both share the same executors. Multi-task adds Python process separation &#8212; but that&#8217;s not where streaming failures originate. Streaming failures live in the JVM, in the query scheduler, in the Delta transaction layer. All of which is still shared.</p><h2><strong>What multi-task actually adds on a shared cluster</strong></h2><p>Splitting into tasks on a shared cluster gives you:</p><ul><li><p>Multiple Python processes on the same driver node</p></li><li><p>Multiple SparkSession lifecycles, each with its own initialisation overhead</p></li><li><p>More listeners, more logging, more scheduler registration</p></li><li><p>Concurrent memory pressure when tasks run in parallel</p></li><li><p>A failing task retrying repeatedly can destabilise the cluster for every other task</p></li></ul><p>You get the operational complexity of multiple processes without the isolation you were looking for.</p><h2><strong>The fix &#8212; and why it works here</strong></h2><p>For the external location incident, we added task-level retry configuration: three retries per task on the continuous job. Once exhausted, Databricks restarts the entire job.</p><p>It works. And it&#8217;s a better failure story than Part 1 &#8212; Task 2 eventually fails loudly and triggers a restart rather than running silently while Task 1 keeps writing data nobody will ever resolve.</p><p>But here&#8217;s the key distinction: <strong>it works because Task 1 and Task 2 are sequential.</strong> Task 2 depends on Task 1. They don&#8217;t run simultaneously. No concurrent driver contention. Failure propagates cleanly up the chain.</p><p>Multi-task on a shared cluster is a reasonable pattern for sequential batch ETL. Feature extraction feeds inference. Inference feeds output. Tasks chain, failures surface, retries make sense.</p><p>The problem is assuming the same pattern works for parallel long-running streaming. That&#8217;s where the shared driver becomes a liability instead of a trade-off.</p><h2><strong>The rule we wrote down</strong></h2><p>After both incidents, this became our working principle:</p><blockquote><p><em><strong>Multi-task on a shared cluster: right for sequential batch ETL, wrong for parallel streaming.</strong></em></p></blockquote><p>The difference is contention. Sequential tasks don&#8217;t compete for the driver simultaneously. Parallel streaming queries do &#8212; continuously, for the lifetime of the job.</p><p>If you&#8217;re running parallel streaming on a shared cluster, a multi-query single task with <code>awaitAnyTermination</code> (Part 1) gives you a cleaner failure boundary than splitting into tasks.</p><p>If you&#8217;re running sequential batch ETL, multi-task with task-level retry is a legitimate approach within the budget constraints of a shared cluster.</p><h2><strong>But this still isn&#8217;t the real answer</strong></h2><p>Both fixes share the same problem.</p><p><code>awaitAnyTermination</code> in Part 1 makes query failures loud. Task retry in Part 2 makes task failures recoverable. Neither prevents a failure in one component from affecting the shared driver &#8212; and everything attached to it.</p><p>The real answer is what we&#8217;d resisted for months: one cluster per task. A failure in the inference pipeline that cannot, by construction, affect the ingestion pipeline.</p><p>That&#8217;s Part 3 &#8212; when we made the architectural change, what it cost, and what got better overnight.</p><p><strong>&#8594; Part 3: One Cluster per Task &#8212; What Real Isolation Actually Looks Like</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Streaming Failure Models: Why "It Didn't Crash" Is the Worst Outcome]]></title><description><![CDATA[Part 1 of 3 &#8212; Databricks Streaming Architecture(Multi-Query Single Job on a Shared Cluster)]]></description><link>https://newsletter.kirankbs.com/p/streaming-failure-models-why-it-didnt</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/streaming-failure-models-why-it-didnt</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Mon, 02 Mar 2026 19:43:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!emMU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c275ca9-e012-4196-ae89-0bd4ebe362aa_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It was 10:47pm. Laptop half-closed.</p><p>Then the message landed: <em>&#8220;Customer is complaining. Alerts are wrong.&#8221;</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I opened the Databricks UI. Seven of eight queries: green. One: failed silently.</p><p>The job said <strong>RUNNING</strong>.</p><p>The customer said something different.</p><p>That gap &#8212; between what the system reports and what is actually happening &#8212; is what this series is about. Not the catastrophic failures. The quiet ones.</p><h2><strong>The incident</strong></h2><p>We were running a multi-query continuous job on a single shared cluster. Eight streaming queries, one Python process, one Spark driver.</p><p>Yes, you could call this a bad design. And you wouldn&#8217;t be wrong. But it was a conscious decision &#8212; budget-constrained infrastructure, deliberately consolidated. We knew the trade-offs. Or thought we did.</p><ul><li><p><strong>Query 1</strong>: bronze ingestion &#8212; reading raw telemetry from EventHub into Delta tables</p></li><li><p><strong>Queries 2&#8211;7</strong>: structuring raw data into individual domain tables</p></li><li><p><strong>Query 8</strong>: alert resolution &#8212; resolving alerts in near real-time that were created by a separate ML job</p></li></ul><p>All of them running together. All of them sharing the same driver.</p><div><hr></div><p>Bronze ingestion was completely fine. But Query 8 &#8212; alert resolution &#8212; hit a <code>ConcurrentAppendException</code>. A Delta Lake conflict that fires when two writers try to commit to the same table at the same time.</p><p>What followed was worse than a crash:</p><ul><li><p>Queries 1&#8211;7 kept running</p></li><li><p>Query 8 silently died</p></li><li><p>For <strong>12 minutes</strong>, offsets stopped committing for that stream while the UI still said RUNNING</p></li><li><p>No monitoring fired</p></li></ul><p>Then came the customer messages.</p><h2><strong>Why 12 minutes matters more than it sounds</strong></h2><p>Twelve minutes of stopped offset commits is not just a monitoring gap.</p><p>It means the system was consuming from EventHub without acknowledging it. If the job had restarted in that window: reprocessing, potential duplicates, depending on your downstream semantics.</p><p>But the job <em>didn&#8217;t</em> restart. Nothing triggered a restart. The driver was alive. The other seven queries were healthy. Spark had no reason to terminate the run.</p><p>The system was wrong for 12 minutes and had no idea.</p><h2><strong>The monitoring irony</strong></h2><p>We had monitoring. Job failure alerts, cluster termination alerts, retry notifications.</p><p>None of it fired.</p><p>Our monitoring was designed to catch <em>job</em> failures. What happened here wasn&#8217;t a job failure &#8212; it was a <em>streaming query</em> failure inside a job that kept running. The job never restarted. Our alerts never triggered.</p><p>The part that still stings: the query that died silently was the one responsible for <em>resolving</em> alerts.</p><p>Here&#8217;s what happened end to end: an ML signal created a CREATE alert. A downstream event should have triggered Query 8 to write a RESOLVE and close it. Query 8 was dead. The RESOLVE never came. The alert stayed open. The customer saw a false positive that should have been resolved hours earlier.</p><p>The system designed to catch problems couldn&#8217;t catch its own.</p><h2><strong>Two classes of failure &#8212; the mental model that finally clicked</strong></h2><p>After this incident, a pattern became clear. There are two fundamentally different ways a streaming job can fail.</p><p><strong>Query-scoped failures (driver stays alive)</strong></p><p><code>ConcurrentAppendException</code>. Schema mismatch. Logic errors in <code>foreachBatch</code>. Wrong Delta path.</p><p>In these cases: one stream fails, the Spark driver JVM stays up, the other queries keep running, and the UI looks mostly healthy. This is the dangerous state &#8212; half a pipeline, half a truth.</p><p><strong>Driver-scoped failures (everything stops)</strong></p><p>Driver OOM. JVM crash. Deep scheduler corruption.</p><p>Loud. Painful. But honest. Everything stops, Databricks restarts, you get a clean slate.</p><p>The uncomfortable reality: <strong>driver-scoped failures are easier to operate.</strong></p><p>A full crash is an operational problem. A partial failure is a product problem. The customer doesn&#8217;t care that Queries 1&#8211;7 were fine if Query 8 was quietly dead.</p><h2><strong>Why we didn&#8217;t catch it sooner &#8212; the awaitTermination trap</strong></h2><p>Here&#8217;s something Databricks doesn&#8217;t shout from the rooftops: when a streaming query fails inside a multi-query job, the Spark JVM doesn&#8217;t automatically know &#8212; or care &#8212; that the query is gone. The other queries keep running. The driver stays up. The job keeps running.</p><p>Our job had eight calls to <code>query.awaitTermination()</code>, one per stream, in sequence. The Python process blocks on the first one. When Query 8 failed, Python was still waiting on an earlier stream&#8217;s termination. It never reached the point of checking Query 8.</p><p>From Python&#8217;s perspective: blocked, waiting. From the JVM&#8217;s perspective: everything is fine. From the UI&#8217;s perspective: RUNNING. From the customer&#8217;s perspective: something is very wrong.</p><p>Worth noting: Databricks documentation doesn&#8217;t recommend either <code>awaitTermination</code> or <code>awaitAnyTermination</code> for production &#8212; they want you to rely on job-level management instead. But in our case, we needed the Python process to exit to trigger a JVM failure, which Databricks would then restart.</p><h2><strong>&#8221;But continuous jobs restart automatically&#8221;</strong></h2><p>They do &#8212; eventually. Continuous jobs retry with backoff and trigger a new run after exhausting retries.</p><p>The catch: for that to happen, the <strong>Spark JVM has to actually fail first</strong>. With <code>query.awaitTermination()</code> on each stream individually, the Python process never exited. Spark never failed. Databricks never restarted.</p><p>The job just sat there. RUNNING. Wrong.</p><h2><strong>The fix &#8212; and why it&#8217;s a band-aid</strong></h2><p>Once we understood the failure chain, the fix was one line.</p><p>Replace individual <code>query.awaitTermination()</code> calls with:</p><pre><code><code>spark.streams.awaitAnyTermination()</code></code></pre><p>Now, if <em>any</em> stream fails, the Python process exits. Python exits &#8594; Spark JVM dies &#8594; Databricks detects the failure &#8594; 3 retries &#8594; new run. Proper failure semantics, clean restart, no more silent death.</p><p>Did it work? Yes. Is it a proper solution? No.</p><p>But who has time for a proper solution when production is on fire?</p><p><code>awaitAnyTermination</code> gave us fail-fast behaviour while we worked on the actual fix. It stopped the lying. It didn&#8217;t fix the underlying architecture.</p><p>The underlying architecture &#8212; why running eight streaming queries in one job on a shared cluster creates these failure modes in the first place, and what we did about it &#8212; is what Part 2 is about.</p><p><strong>&#8594; Part 2: Multi-Task on a Shared Cluster &#8212; Why That&#8217;s Also Not Enough</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding Delta Table Partition Size Distribution Using the Delta Log]]></title><description><![CDATA[Databricks/Delta Partitioning Strategies]]></description><link>https://newsletter.kirankbs.com/p/understanding-delta-table-partition</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/understanding-delta-table-partition</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Mon, 16 Feb 2026 11:08:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jXIV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jXIV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jXIV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!jXIV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!jXIV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!jXIV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jXIV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1208420,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.kirankbs.com/i/188125131?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jXIV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!jXIV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!jXIV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!jXIV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f790e3d-f2fa-465f-937c-340cd1f9ff38_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When working with <strong>externally managed Delta tables</strong> and traditional partitioning strategies (for example by <code>day</code>, <code>week</code>, or <code>month</code>), one common challenge is:</p><blockquote><p><em>How large are my partitions actually?</em></p></blockquote><p>Before deciding whether to partition by <strong>day vs. week vs. month</strong>, it&#8217;s important to understand how data is physically distributed across partitions.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This article shows how to extract <strong>partition-level size statistics directly from the Delta transaction log</strong>.</p><blockquote><p>&#9888;&#65039; This approach is useful when using traditional partitioning and external table strategies.<br>If you&#8217;re using <strong>Unity Catalog managed tables</strong> or <strong>Liquid Clustering</strong>, partition sizing decisions are handled differently.</p></blockquote><h2>Why Look at the Delta Log?</h2><p>Every Delta table maintains a <code>_delta_log</code> directory containing JSON transaction files.<br>Each <code>add</code> action inside the log includes:</p><ul><li><p>File path</p></li><li><p>File size (in bytes)</p></li><li><p>Partition values</p></li></ul><p>By reading these logs, we can compute:</p><ul><li><p>Number of files per partition</p></li><li><p>Total bytes per partition</p></li><li><p>Size distribution across partitions</p></li></ul><p>This gives direct visibility into the <strong>physical layout</strong> of your table.</p><h2>Example: Compute Partition Sizes by <code>startDate</code></h2><pre><code><code>from pyspark.sql import functions as F

delta_log = spark.read.json(
    "abfss://container@storage-account.dfs.core.windows.net/table_location/_delta_log/*.json"
)

files = (
    delta_log
    .filter("add is not null")
    .select(
        F.col("add.path").alias("path"),
        F.col("add.size").alias("size"),
        F.col("add.partitionValues.startDate").alias("startDate")
    )
)

(
    files.groupBy("startDate")
    .agg(
        F.count("*").alias("numFiles"),
        F.sum("size").alias("totalBytes")
    )
    .withColumn("sizeGB", F.col("totalBytes") / (1024**3))
    .orderBy("startDate", ascending=False)
    .show(20, False)
)
</code></code></pre><h3>Sample Output</h3><pre><code><code>+----------+--------+-----------+--------+
|startDate |numFiles|totalBytes |sizeGB  |
+----------+--------+-----------+--------+
|2026-02-16|  xxx   |   xxx     |   xx   |
|2026-02-15|  xxx   |   xxx     |   xx   |
+----------+--------+-----------+--------+</code></code></pre><h2>How This Helps With Partition Strategy</h2><p>Once you have partition size metrics, you can evaluate:</p><h3>Are partitions too small?</h3><p>If daily partitions are only a few MB:</p><ul><li><p>You may be over-partitioning.</p></li><li><p>Consider partitioning by week or month instead.</p></li></ul><h3>Are partitions too large?</h3><p>If partitions exceed hundreds of GB:</p><ul><li><p>Queries may scan too much data.</p></li><li><p>Consider a finer-grained partitioning strategy.</p></li></ul><h3>Are files unevenly distributed?</h3><p>High <code>numFiles</code> with small average size indicates small file issues.</p><h2>Decision Guidelines</h2><p>Partition Size (per partition) Recommendation</p><p>&lt; 1 GB Likely over-partitioned</p><p>1&#8211;20 GB Usually healthy</p><p>50+ GBConsider finer partitioning</p><p>100+ GB May impact performance</p><p><em>(Adjust based on workload and query patterns.)</em></p><h2>Bonus: Average File Size Per Partition</h2><p>You can extend the analysis:</p><pre><code><code>.withColumn("avgFileSizeMB", (F.col("totalBytes") / F.col("numFiles")) / (1024**2))
</code></code></pre><p>This helps detect small file problems inside partitions.</p><h2>When Should You Use This?</h2><p>This approach is particularly useful when:</p><ul><li><p>Using <strong>external Delta tables</strong></p></li><li><p>Managing your own storage layout</p></li><li><p>Designing a new partition strategy</p></li><li><p>Migrating legacy Hive-style tables</p></li><li><p>Troubleshooting performance issues</p></li></ul><p>It gives a <strong>low-level, transparent view</strong> of how data is physically stored.</p><h2>Final Thoughts</h2><p>Partitioning decisions should be based on:</p><ul><li><p>Query access patterns</p></li><li><p>Partition cardinality</p></li><li><p>Physical partition size</p></li><li><p>File size distribution</p></li></ul><p>Reading the Delta transaction log provides a simple yet powerful way to understand your table layout &#8212; before committing to a partitioning strategy.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Simple Steps Towards Data Processing]]></title><description><![CDATA[Data Engineering Essentials]]></description><link>https://newsletter.kirankbs.com/p/simple-steps-towards-data-processing</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/simple-steps-towards-data-processing</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Thu, 28 Mar 2024 10:15:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VpID!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VpID!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VpID!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 424w, https://substackcdn.com/image/fetch/$s_!VpID!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 848w, https://substackcdn.com/image/fetch/$s_!VpID!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 1272w, https://substackcdn.com/image/fetch/$s_!VpID!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VpID!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png" width="1072" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1072,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73963,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VpID!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 424w, https://substackcdn.com/image/fetch/$s_!VpID!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 848w, https://substackcdn.com/image/fetch/$s_!VpID!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 1272w, https://substackcdn.com/image/fetch/$s_!VpID!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e272b45-6551-4ac1-a5c4-a198a634d18c_1072x767.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Welcome, Aspiring Data Engineers, to the second installment of our Data Engineering Essentials series!</p><p>Our second post, <a href="https://newsletter.kirankbs.com/p/unlocking-data-engineering-simplifying">Unlocking Data Engineering: Simplifying the Journey for Beginners</a>, guided many through the process of reading from APIs and writing into Snowflake. It was like a walk in the park for many beginners, and writing into Snowflake felt like a breeze. If you haven't read it yet, I highly recommend doing so for some quick gratification!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Now, in our third newsletter post, "Simple Steps Towards Data Processing," I've ensured that the code remains straightforward, making it easy for you to understand and execute successfully.</p><p>If you ever find yourself stuck at any point, don't hesitate to reach out to me on LinkedIn at <a href="https://www.linkedin.com/in/kirankbs/">kirankbs</a>!</p><p>Let's embark on this beginner-friendly journey together, understand the process, and achieve another swift victory!</p><h2>Agenda: Basic Data Processing</h2><ul><li><p>Data Cleaning: Convert Dates &amp; Non numerical values</p></li><li><p>Feature Engineering: Create new features by deriving insights from existing columns</p></li><li><p>Data Exploration: Show Summary Statistics &amp; Outliers</p></li></ul><p>For Prerequisites &amp; Setting Up Python Project, please <a href="https://newsletter.kirankbs.com/i/142662720/prerequisites">refer to the previous article</a></p><p>Please note that this example utilizes MacOS and Python version 3.8.8. However, users on Linux and Windows systems should encounter no issues!</p><h3>Data Cleaning</h3><p>When extracting data from source systems, never assume that the quality of the data is excellent and in the expected formats. As a Data Engineer, you must ensure that it's converted into the correct type for further processing and to clean any unwanted or corrupted data. Make sure the data is read with the proper schema.</p><pre><code># Define the Schema
schema = {
    "vendorid": "str",
    "lpep_pickup_datetime": "str",
    "lpep_dropoff_datetime": "str",
    "store_and_fwd_flag": "str",
    "ratecodeid": "str",
    "pickup_longitude": "float",
    "pickup_latitude": "float",
    "dropoff_longitude": "float",
    "dropoff_latitude": "float",
    "passenger_count": "int",
    "trip_distance": "float",
    "fare_amount": "float",
    "extra": "float",
    "mta_tax": "float",
    "tip_amount": "float",
    "tolls_amount": "float",
    "imp_surcharge": "float",
    "total_amount": "float",
    "payment_type": "str",
    "trip_type": "str"
}

# Fetch data from API
api_url = "https://data.cityofnewyork.us/resource/hvrh-b6nb.json"
response = requests.get(api_url)
data = response.json()

# Convert data to Pandas DataFrame
df = pd.DataFrame(data, columns=schema.keys())</code></pre><p>Convert the date from a string type to a DateTime type for further processing. By converting the date from a string to a DateTime type, you gain access to a variety of helper methods in Pandas that can be utilized.</p><pre><code># Data cleaning
# Example: Convert data types
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
df['lpep_dropoff_datetime'] = pd.to_datetime(df['lpep_dropoff_datetime'])

# Remove non-numeric values from 'trip_distance' to get insights on numbers
df['trip_distance'] = pd.to_numeric(df['trip_distance'], errors='coerce')  # Convert non-numeric values to NaN</code></pre><p>Well done! The data is now clean and prepared for further processing.</p><h3>Feature Engineering</h3><p>Generating new features from existing columns offers users deeper insights and makes visualizing aggregated information much clearer.</p><p>Let's derive additional insights from our existing columns. For instance, we can calculate metrics such as trip duration, pickup hour, pickup day of the week, and speed.</p><pre><code>df['trip_duration'] = (df['lpep_dropoff_datetime'] - df['lpep_pickup_datetime']).dt.total_seconds() / 60
df['pickup_hour'] = df['lpep_pickup_datetime'].dt.hour
df['pickup_dayofweek'] = df['lpep_pickup_datetime'].dt.dayofweek
df['speed'] = df['trip_distance'] / df['trip_duration']</code></pre><h3>Data Exploration</h3><p>We've finished cleaning the data and adding new features. Now, let's take the final step and explore the data! The top 100 records from the dataset are displayed in the console.</p><h4>View Data</h4><pre><code>Color.printGreen("Check the first few rows of the DataFrame:")
print(df.head(100))</code></pre><h4>View Missing Values</h4><pre><code>Color.printGreen("Check for missing values:")
print(df.isnull().sum())</code></pre><h4>View Summary for Trip Duration</h4><pre><code>trip_duration_stats = df['trip_duration'].describe()
Color.printGreen("Summary Statistics for Trip Duration:")
print(trip_duration_stats)</code></pre><h4>View Outliers</h4><pre><code>df['fare_amount'] = pd.to_numeric(df['fare_amount'], errors='coerce')
outliers = df[df['fare_amount'] &lt; 0]
Color.printGreen("Outliers in fare_amount column:")
print(outliers)</code></pre><h3>Write To File</h3><p>Define the features you want to include in the new DataFrame:</p><pre><code>selected_features = ['vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'passenger_count', 'trip_distance',
                     'fare_amount', 'total_amount', 'trip_duration', 'pickup_hour', 'pickup_dayofweek', 'speed']</code></pre><p>Select the specified features from the original DataFrame</p><pre><code>new_df = df[selected_features].copy()</code></pre><p>Write the new DataFrame to a CSV file</p><pre><code>new_df.to_csv('trips_insights.csv', index=False)</code></pre><p>Congratulations, If you have come across this far!</p><p>This is it! It is this simple to fetch data from API and process it. You don&#8217;t need Spark, Flink, or Notebook .. to write simple code.</p><p><strong>Note: </strong>This application is far fetch from completion but let&#8217;s take basic steps and improve this application further!</p><h2>What Is Next</h2><p><strong>In the upcoming newsletter, I'll delve into captivating visualizations and dashboards without using PowerBI or Tableau.</strong></p><p>Please subscribe for more such content and share for the far reach!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashDataBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Unlocking Data Engineering: Simplifying the Journey for Beginners]]></title><description><![CDATA[Data Engineering Essentials]]></description><link>https://newsletter.kirankbs.com/p/unlocking-data-engineering-simplifying</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/unlocking-data-engineering-simplifying</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Mon, 18 Mar 2024 08:14:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Z-2A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z-2A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z-2A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 424w, https://substackcdn.com/image/fetch/$s_!Z-2A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 848w, https://substackcdn.com/image/fetch/$s_!Z-2A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 1272w, https://substackcdn.com/image/fetch/$s_!Z-2A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z-2A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png" width="1072" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1072,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148021,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z-2A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 424w, https://substackcdn.com/image/fetch/$s_!Z-2A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 848w, https://substackcdn.com/image/fetch/$s_!Z-2A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 1272w, https://substackcdn.com/image/fetch/$s_!Z-2A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444c6385-3692-49c1-a9de-4b80b3b11572_1072x767.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Navigating the vast landscape of Data Engineering tools and concepts can be overwhelming for newcomers, with a multitude of elements to consider: data sources, storage solutions, schemas, formats, pipelines, visualization tools, and processing engines etc</p><p><strong>Aspiring Data Engineers</strong>, Getting quick wins is crucial for maintaining momentum and motivation!</p><p>Inspiration for this newsletter post is from Data Expert <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Zach Wilson&quot;,&quot;id&quot;:10367987,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a857d08-ec8d-4a0e-9cb5-ad8434fe519e_2333x3500.jpeg&quot;,&quot;uuid&quot;:&quot;910476c0-b9f9-4dda-9041-0b5e2b5d9bb7&quot;}" data-component-name="MentionToDOM"></span>  Linkedin <a href="https://www.linkedin.com/feed/update/urn:li:activity:7167949764481761280/">Post</a> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QaLv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QaLv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 424w, https://substackcdn.com/image/fetch/$s_!QaLv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 848w, https://substackcdn.com/image/fetch/$s_!QaLv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!QaLv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QaLv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png" width="1130" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1130,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190463,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QaLv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 424w, https://substackcdn.com/image/fetch/$s_!QaLv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 848w, https://substackcdn.com/image/fetch/$s_!QaLv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!QaLv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2a4653-56a5-448e-ba02-2e9a4d59cd1e_1130x1092.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For this newsletter post, I deliberately began with a small dataset and relied solely on Python Pandas to experience the quick win and gain a nutshell understanding of end-to-end data processing.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.kirankbs.com/subscribe?"><span>Subscribe now</span></a></p><p>At any point in time you think you got stuck, feel free to ping me on Linkedin <a href="https://www.linkedin.com/in/kirankbs/">kirankbs</a>! </p><p>Let's stroll through this beginner-friendly process to grasp what's going on, and I'll help you navigate through it to achieve a swift victory!</p><h2>Prerequisites</h2><ul><li><p><a href="https://www.python.org/downloads/">Python</a></p></li><li><p><a href="https://pip.pypa.io/en/stable/installation/">pip</a></p></li></ul><p>Please note that this example utilizes MacOS and Python version 3.8.8. However, users on Linux and Windows systems should encounter no issues!</p><h2>Set Up Python Project</h2><p>Download or Clone source code from <a href="https://github.com/UnleashDataBytes/UnleashNYCTaxiDataPython">GitHub Repository</a>. You can open following project any IDE such as IntelliJ/PyCharm, VSCode or simply Text Editor.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UnKJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UnKJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 424w, https://substackcdn.com/image/fetch/$s_!UnKJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 848w, https://substackcdn.com/image/fetch/$s_!UnKJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 1272w, https://substackcdn.com/image/fetch/$s_!UnKJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UnKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png" width="1456" height="756" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24199add-324c-41b0-9712-4d488241caf7_1904x988.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:756,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240048,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UnKJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 424w, https://substackcdn.com/image/fetch/$s_!UnKJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 848w, https://substackcdn.com/image/fetch/$s_!UnKJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 1272w, https://substackcdn.com/image/fetch/$s_!UnKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24199add-324c-41b0-9712-4d488241caf7_1904x988.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s create a Python virtual environment to install project dependencies locally. Run the below commands inside the project folder</p><pre><code>python -m venv venv</code></pre><p>Activate Python Virtual environment</p><pre><code>source venv/bin/activate</code></pre><p>Install Dependencies</p><pre><code>pip install -r requirements.txt</code></pre><p>Now, the project is downloaded and dependencies are installed!</p><p>However, you'll still need to set up a Snowflake account and configure the project before running the application.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.kirankbs.com/subscribe?"><span>Subscribe now</span></a></p><h2>Snowflake Set Up</h2><p>Snowflake Cloud offers a 30-day free trial, providing ample time to explore and gain practical experience. Once you've signed up and created an account, you should configure the code block mentioned below inside the file <a href="https://github.com/UnleashDataBytes/UnleashNYCTaxiDataPython/blob/93822e786b7938528b544e5d1aa88cc658334f9e/src/nyc-taxi-data-e2e/WriteToSnowflake.py#L43">WriteToSnowflake.py</a></p><pre><code><code>conn_params = {
    'account': 'account name', # Replace with your account name
    'user': 'user name', # Replace with your username
    'password': 'password',   # Replace with your password
    'warehouse': 'NYCTAXI',
    'database': 'NYCTAXIDATABASE',
    'schema': 'public'
}</code></code></pre><p>with the following details:</p><p><strong>Account</strong>: This serves as the unique identifier for your Snowflake account. You typically receive it upon signing up for Snowflake. You can also find it in the URL you use to access the Snowflake web interface. For instance, if the URL is <a href="https://app.snowflake.com/lzkqavc/dq12000">https://app.snowflake.com/lzkqavc/dq12000</a>, the account name would be lzkqavc-dq12000. Alternatively, you can locate it under Admin &#8594; Accounts, resembling https://&lt;account&gt;.snowflakecomputing.com.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZUhD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZUhD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 424w, https://substackcdn.com/image/fetch/$s_!ZUhD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 848w, https://substackcdn.com/image/fetch/$s_!ZUhD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 1272w, https://substackcdn.com/image/fetch/$s_!ZUhD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZUhD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png" width="1456" height="607" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:607,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316746,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZUhD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 424w, https://substackcdn.com/image/fetch/$s_!ZUhD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 848w, https://substackcdn.com/image/fetch/$s_!ZUhD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 1272w, https://substackcdn.com/image/fetch/$s_!ZUhD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cde18bb-cb6b-40b8-8679-7055e2a31f18_3568x1488.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>User</strong>: This is the username you use to log in to Snowflake.</p><p><strong>Password</strong>: This is the password associated with your Snowflake user account.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5RQn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5RQn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 424w, https://substackcdn.com/image/fetch/$s_!5RQn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 848w, https://substackcdn.com/image/fetch/$s_!5RQn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 1272w, https://substackcdn.com/image/fetch/$s_!5RQn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5RQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png" width="516" height="545.0120481927711" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1052,&quot;width&quot;:996,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:113016,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5RQn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 424w, https://substackcdn.com/image/fetch/$s_!5RQn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 848w, https://substackcdn.com/image/fetch/$s_!5RQn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 1272w, https://substackcdn.com/image/fetch/$s_!5RQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F324d2e99-dd6d-4d83-9d4c-585ae8ad1fe1_996x1052.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Warehouse</strong>: In Snowflake, a warehouse is a computing resource that executes SQL queries. Create a new warehouse <em>NYCTAXI</em> by navigating to the Admin -&gt; Warehouses tab.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IVQo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IVQo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 424w, https://substackcdn.com/image/fetch/$s_!IVQo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 848w, https://substackcdn.com/image/fetch/$s_!IVQo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!IVQo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IVQo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png" width="1456" height="613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:613,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:274520,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IVQo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 424w, https://substackcdn.com/image/fetch/$s_!IVQo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 848w, https://substackcdn.com/image/fetch/$s_!IVQo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!IVQo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85433a-7321-4afa-bb37-9aa2c7129cef_3518x1482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Database</strong>: A database in Snowflake is a container for your data and database objects. Create a new database <em>NYCTAXIDATABASE</em> by logging in to Snowflake and navigating to the Data &#8594; Databases tab.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CE8H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CE8H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 424w, https://substackcdn.com/image/fetch/$s_!CE8H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 848w, https://substackcdn.com/image/fetch/$s_!CE8H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!CE8H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CE8H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png" width="1456" height="451" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:451,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CE8H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 424w, https://substackcdn.com/image/fetch/$s_!CE8H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 848w, https://substackcdn.com/image/fetch/$s_!CE8H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!CE8H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e62ce6-de22-4239-808c-7644ce173cd3_3542x1098.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Schema</strong>: A schema in Snowflake is a logical container for database objects, such as tables, views, and functions. You can find the available schemas by logging in to Snowflake and selecting the desired database. <em>public</em> schema is the default schema in Snowflake.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c76b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c76b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 424w, https://substackcdn.com/image/fetch/$s_!c76b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 848w, https://substackcdn.com/image/fetch/$s_!c76b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!c76b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c76b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png" width="1456" height="482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:482,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:265030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c76b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 424w, https://substackcdn.com/image/fetch/$s_!c76b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 848w, https://substackcdn.com/image/fetch/$s_!c76b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!c76b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bfaf95-ed15-458b-9b73-69f084c8ba8a_3574x1184.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Create table Table inside the <em>public</em> schema with the following definition</p><pre><code>create or replace TABLE NYCTAXIDATABASE.PUBLIC.TRIPS (
&#9;VENDORID VARCHAR(16777216),
&#9;LPEP_PICKUP_DATETIME VARCHAR(16777216),
&#9;LPEP_DROPOFF_DATETIME VARCHAR(16777216),
&#9;STORE_AND_FWD_FLAG VARCHAR(16777216),
&#9;RATECODEID VARCHAR(16777216),
&#9;PICKUP_LONGITUDE NUMBER(38,0),
&#9;PICKUP_LATITUDE NUMBER(38,0),
&#9;DROPOFF_LONGITUDE NUMBER(38,0),
&#9;DROPOFF_LATITUDE NUMBER(38,0),
&#9;PASSENGER_COUNT NUMBER(38,0),
&#9;TRIP_DISTANCE NUMBER(38,0),
&#9;FARE_AMOUNT NUMBER(38,0),
&#9;EXTRA NUMBER(38,0),
&#9;MTA_TAX NUMBER(38,0),
&#9;TIP_AMOUNT NUMBER(38,0),
&#9;TOLLS_AMOUNT NUMBER(38,0),
&#9;IMP_SURCHARGE NUMBER(38,0),
&#9;TOTAL_AMOUNT NUMBER(38,0),
&#9;PAYMENT_TYPE VARCHAR(16777216),
&#9;TRIP_TYPE VARCHAR(16777216)
);</code></pre><h2>Run the Application</h2><p>As the code is pretty self-explanatory, let&#8217;s run the below commands inside the project folder</p><pre><code>python src/nyc-taxi-data-e2e/WriteToSnowflake.py</code></pre><p>You will find the following result </p><p>in Console</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x3LP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x3LP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 424w, https://substackcdn.com/image/fetch/$s_!x3LP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 848w, https://substackcdn.com/image/fetch/$s_!x3LP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 1272w, https://substackcdn.com/image/fetch/$s_!x3LP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x3LP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png" width="1456" height="145" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:145,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:426916,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x3LP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 424w, https://substackcdn.com/image/fetch/$s_!x3LP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 848w, https://substackcdn.com/image/fetch/$s_!x3LP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 1272w, https://substackcdn.com/image/fetch/$s_!x3LP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae94989-4289-4ce9-89f6-1b9290879fa5_3466x344.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Click Refresh button at Data &#8594; Databases &#8594; NYCTAXIDATABASE &#8594; public &#8594; Tables &#8594; TRIPS &#8594; Data Preview. Tada! There you go with the Data!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZdOn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZdOn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 424w, https://substackcdn.com/image/fetch/$s_!ZdOn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 848w, https://substackcdn.com/image/fetch/$s_!ZdOn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 1272w, https://substackcdn.com/image/fetch/$s_!ZdOn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZdOn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png" width="1456" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:501405,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZdOn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 424w, https://substackcdn.com/image/fetch/$s_!ZdOn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 848w, https://substackcdn.com/image/fetch/$s_!ZdOn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 1272w, https://substackcdn.com/image/fetch/$s_!ZdOn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c803ecc-24da-4c26-a210-9b9e2edc81d9_3562x1464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Congratulations, If you have come across this far! </p><p>This is it! It is this simple to fetch data from API and insert it into the Snowflake Database. You don&#8217;t need Spark, Flink, or Notebook .. to write simple code.</p><p><strong>Note: </strong>This application is far fetch from completion but let&#8217;s take basic steps and improve this application further!</p><p>Please subscribe for more such content and share for the far reach! </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.kirankbs.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[OAuth Redirect in a nutshell!]]></title><description><![CDATA[OAuth Redirect is important step in whole mechanism!]]></description><link>https://newsletter.kirankbs.com/p/oauth-redirect-in-a-nutshell</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/oauth-redirect-in-a-nutshell</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Thu, 07 Mar 2024 20:11:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bymo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bymo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bymo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 424w, https://substackcdn.com/image/fetch/$s_!bymo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 848w, https://substackcdn.com/image/fetch/$s_!bymo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 1272w, https://substackcdn.com/image/fetch/$s_!bymo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bymo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png" width="1016" height="1055" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1055,&quot;width&quot;:1016,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89698,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bymo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 424w, https://substackcdn.com/image/fetch/$s_!bymo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 848w, https://substackcdn.com/image/fetch/$s_!bymo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 1272w, https://substackcdn.com/image/fetch/$s_!bymo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71c38f8-6a18-44f1-b9d3-8c22a7cf1477_1016x1055.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The first time I encountered the OAuth mechanism, I was puzzled by a particular question: How does the OAuth provider reach the web app endpoint, such as a service running at localhost?</p><p>Here's the thing: When it comes to OAuth provider, it doesn't directly contact your local service. Instead, it leads the user's browser back to your service with an authorization code.</p><p>But hold on a second... Doesn't redirection imply calling my service's endpoint?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Here, the crucial point to remember is the fourth one listed below: You'll receive an HTTP 302 response, with a `Location` header pointing to your callback URL.</p><p>If you're still feeling unsure, or if you're in the same boat as I was, let's dive deeper:</p><p>I'll explain using the GitHub OAuth Provider and animated flow as an example</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J-ke!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J-ke!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 424w, https://substackcdn.com/image/fetch/$s_!J-ke!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 848w, https://substackcdn.com/image/fetch/$s_!J-ke!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 1272w, https://substackcdn.com/image/fetch/$s_!J-ke!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J-ke!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17252692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J-ke!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 424w, https://substackcdn.com/image/fetch/$s_!J-ke!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 848w, https://substackcdn.com/image/fetch/$s_!J-ke!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 1272w, https://substackcdn.com/image/fetch/$s_!J-ke!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aa04dd5-cd8d-4215-b4be-d9015fa48201_1016x917.gif 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>.</p><p><strong>Callback URL</strong>: This is a URL in your application where GitHub's OAuth service will redirect the user after they have authorized your application. This URL is provided by you when you redirect the user to GitHub's OAuth service. In your case, it's localhost:8080/github/callback.</p><p><strong>Redirecting to GitHub's OAuth service</strong>: When your application wants to authenticate a user, it redirects them to GitHub's OAuth service. This is done by creating a URL to GitHub's OAuth service that includes your application's client ID, requested scopes, and the callback URL. This URL is generated by `oAuthConfig.AuthCodeURL(stateToken)` in your code.</p><p><strong>User authorizes your application</strong>: The user logs in to GitHub and is asked if they want to give your application the permissions it's requesting (the scopes). If they agree, GitHub's OAuth service will redirect them back to your application using the callback URL you provided.</p><p><strong>Redirect back to your application</strong>: GitHub's OAuth service redirects the user's browser back to your application by sending an HTTP 302 response with a `Location` header set to your callback URL. It appends an authorization code as a query parameter to this URL. The user's browser follows this redirect, making a request to your callback URL with the authorization code in the query string.</p><p><strong>Your application exchanges the authorization code for an access token</strong>: Your application extracts the authorization code from the query string and makes a server-to-server request to GitHub's OAuth service to exchange the authorization code for an access token.</p><p><strong>Summary</strong></p><p>So, GitHub's OAuth service doesn't directly call your callback URL. Instead, it relies on the user's browser to make a request to your callback URL with the authorization code. This is why the callback URL must be a URL that the user's browser can reach.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading UnleashBytes! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Welcome to UnleashDataBytes]]></title><description><![CDATA[Tech Newsletter]]></description><link>https://newsletter.kirankbs.com/p/coming-soon</link><guid isPermaLink="false">https://newsletter.kirankbs.com/p/coming-soon</guid><dc:creator><![CDATA[kiran kumar]]></dc:creator><pubDate>Sat, 02 Mar 2024 13:00:09 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/df9afe5f-7220-4e63-8271-1f181d8fc47e_800x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Welcome to our Substack newsletter! We explore complex concepts around data and Software Engineering, sharing insights from my daily experiences and personal projects.</p><p>What makes us unique? We break down complex topics Byte by Byte, empowering you to become a better data engineer. Our goal? Simplify the tech world for you!</p><p>Ready to dive in and level up your skills? Subscribe now and let's unleash your inner engineer!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.kirankbs.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://newsletter.kirankbs.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item></channel></rss>