
[{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/aws/","section":"Tags","summary":"","title":"Aws","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/series/bird-hotspot-finder/","section":"Series","summary":"","title":"Bird Hotspot Finder","type":"series"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/data-engineering/","section":"Tags","summary":"","title":"Data-Engineering","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/dbeaver/","section":"Tags","summary":"","title":"Dbeaver","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/dlt/","section":"Tags","summary":"","title":"Dlt","type":"tags"},{"content":" Tip Check out this repository to see the end product.\nBackground # One of the cooler things about dlt is the schema inference and its inherent statefulness. As data engineers, we are typically subject to our data sources changing shape at any given point in time. Let\u0026rsquo;s move beyond our simple pipeline from the previous post and add three new features:\nSend the dlt loaded package data to AWS S3. Store original schema and then any future changes in our destination database. Send Slack notifications via incoming webhooks on changed table schemas (comes from package information). Send load package data to AWS S3 bucket # Create S3 bucket and IAM role with permissions # These steps are fairly straightforward if you\u0026rsquo;re familiar with AWS, but if you\u0026rsquo;re not, here is a video overview. As you watch, know that your goals are essentially:\nCreate an AWS account (or log in to your existing one). Make sure that you have a user. Make sure that you have a role with the right permissions (AWS calls it IAM, or Identity Access Management). Click into the S3 service and create a new bucket to hold your files. Update secrets.toml and pipeline.py # The way that we\u0026rsquo;re going to accomplish sending our pipeline data to S3 is by creating a staging (read: intermediate) layer between our source and destination. Within the scope of our example, we\u0026rsquo;re going to send eBird API data (source) to a destination compatible with remote staging layers.\nNote: because I am a poor simple caveman data engineer, I do not have a personal cloud warehouse at my disposal. Even though the DuckDB destination is not enabled with having a dlt staging layer, we will continue our example like it is.\nTo update the secrets.toml file, add the following:\n[destination.filesystem] bucket_url = \u0026#34;s3://[your_bucket_name]\u0026#34; # replace with your bucket name [destination.filesystem.credentials] aws_access_key_id = \u0026#34;please set me up!\u0026#34; # copy the access key here aws_secret_access_key = \u0026#34;please set me up!\u0026#34; # copy the secret access key here We\u0026rsquo;ll simply need to add the following to our dlt pipeline object:\nif __name__ == \u0026#34;__main__\u0026#34;: pipeline = dlt.pipeline( pipeline_name=\u0026#39;ebirdapi\u0026#39;, destination=\u0026#39;duckdb\u0026#39;, staging=\u0026#39;filesystem\u0026#39;, # This line is what you\u0026#39;ll need to add dataset_name=\u0026#39;ebirdapi_data\u0026#39;, full_refresh=True, ) Conveniently, dlt will choose the file format for us — but you can always manually specify, like so:\nload_info = pipeline.run(ebirdapi_source(), loader_file_format=\u0026#34;parquet\u0026#34;) Caution At this step, dlt will not dump the current schema content to the bucket. Please continue for that.\nSend load package schemas to destination database # Originally, this section was going to be titled: \u0026ldquo;Send schemas to AWS S3.\u0026rdquo; And it\u0026rsquo;s great if you\u0026rsquo;d like to do that. Simply ask ChatGPT to \u0026ldquo;write a Python script to send a file in a specific directory to an AWS S3 bucket\u0026rdquo; and you\u0026rsquo;ll be good to go. HOWEVER, because dlt is stateful, you might as well add two lines to your pipeline script and load the schema changes as tables directly to your destination warehouse.\nAdding these two lines for an extra run of your pipeline object will initially generate the schema of your table and columns, and subsequently will generate schema changes:\ntable_updates = [p.asdict()[\u0026#34;tables\u0026#34;] for p in load_info.load_packages] pipeline.run(table_updates, table_name=\u0026#34;_new_tables\u0026#34;) Send Slack notification on schema change # Use the following Slack walkthrough to:\nCreate a Slack app. Enable incoming webhooks for the new app. Create an incoming webhook (get a URL to pass to your secrets.toml for the new app). Add the following code to your script:\nfrom dlt.common.runtime.slack import send_slack_message send_slack_message(pipeline.runtime_config.slack_incoming_hook, message) ","date":"15 June 2023","externalUrl":null,"permalink":"/posts/bird-finder-via-dlt-ii/","section":"Posts","summary":"Extending the dlt pipeline: stage load packages to S3, persist schema changes back to DuckDB, and ping Slack on schema drift.","title":"dlt Demo: Bird Hotspot Finder, pt. II","type":"posts"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/duckdb/","section":"Tags","summary":"","title":"Duckdb","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/ebird/","section":"Tags","summary":"","title":"Ebird","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/","section":"Lough on Data","summary":"","title":"Lough on Data","type":"page"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/s3/","section":"Tags","summary":"","title":"S3","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/slack/","section":"Tags","summary":"","title":"Slack","type":"tags"},{"content":"","date":"15 June 2023","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" Tip Check out this repository to see the end product.\nBackground # There are a couple of tools that I\u0026rsquo;ve been meaning to demo but haven\u0026rsquo;t found the time… no longer! Today we\u0026rsquo;re going to demo the dlt (data load tool) package as an alternative to Meltano or Airbyte. We\u0026rsquo;ll also load that data to a duckdb and then use DBeaver as an IDE for querying. dlt is very cool in that it takes advantage of native data objects in Python for loading, has schema inference, schema evolution (exportable files), and can alert folks when things change. One thing I was not a fan of when using Meltano was that so many of the taps/sources were written to behave like the following:\nuse credentials to create a connection load the entire dataset bring to staging to be ingested to your favorite cloud warehouse Which works great, until it doesn\u0026rsquo;t. There are so many times when a single table is enormous (anything over 100GB), and loading in the whole table at once is just not an option. Enter dlt, which:\n…For example, consider a scenario where you need to extract data from a massive database with millions of records. Instead of loading the entire dataset at once, dlt allows you to use iterators to fetch data in smaller, more manageable portions. This technique enables incremental processing and loading, which is particularly useful when dealing with limited memory resources.\nWithout further ado:\nEnvironment set up # Make sure that you have poetry installed from the PyPi package manager. Once that\u0026rsquo;s done, do the following:\nmake a new directory with whatever name you like mkdir birds_are_cool \u0026amp;\u0026amp; cd birds_are_cool In this new directory, create a file called pyproject.toml, and then copy + paste the following into that file: [tool.poetry] name = \u0026#34;bird-finder-2.0\u0026#34; version = \u0026#34;0.1.0\u0026#34; description = \u0026#34;birding hotspots relative to current location\u0026#34; authors = [\u0026#34;Your Name \u0026lt;you@example.com\u0026gt;\u0026#34;] [tool.poetry.dependencies] python = \u0026#34;^3.10\u0026#34; dlt = \u0026#34;^0.3.12\u0026#34; duckdb = \u0026#34;^0.8.0\u0026#34; python-dotenv = \u0026#34;^0.20.0\u0026#34; click = \u0026#34;^8.1.1\u0026#34; colorama = \u0026#34;^0.4.4\u0026#34; [tool.poetry.dev-dependencies] [build-system] requires = [\u0026#34;poetry-core\u0026gt;=1.2.0\u0026#34;] build-backend = \u0026#34;poetry.core.masonry.api\u0026#34; now we\u0026rsquo;re going to create the environment that we\u0026rsquo;ve described via the pyproject.toml poetry install (which will create a snapshot of the environment via the new file poetry.lock) and finally, we\u0026rsquo;re going to activate this new environment we\u0026rsquo;ve made poetry shell So, now we\u0026rsquo;ve got everything set up for dlt and duckdb — last but not least, install the community edition (read \u0026ldquo;free\u0026rdquo; edition) of the IDE called DBeaver.\nInitialize a new project # We have our environment set up, now it\u0026rsquo;s time to get loadin\u0026rsquo;. For our demo purposes, we\u0026rsquo;re going to use the eBird API because it\u0026rsquo;s very well documented, and load data from the exposed API endpoints into our duckdb. dlt makes the duckdb for us when we eventually run our pipeline, we don\u0026rsquo;t need to do ANYTHING. Within our birds_are_cool directory, run the following:\nScaffold a new project structure… duckdb is already a dlt-supported destination (along with Snowflake and others), but the ebird_api is not a verified source so we\u0026rsquo;ll be modifying it ourselves (don\u0026rsquo;t be scared, it\u0026rsquo;s ONE file). dlt init ebird_api duckdb You should now have the following file structure in your directory: .dlt folder .gitignore file ebirdapi.py poetry.lock pyproject.toml README.md Set up source connection # Get our API key from eBird # Create an account. If you don\u0026rsquo;t already have an account on the eBird website, you\u0026rsquo;ll need to create one. Go to eBird and sign up. Log in. After creating an account, log in. Navigate to the API page. Visit https://ebird.org/api/keygen. Request an API key. Fill out the form with the required information — name, email, project description, intended use. Agree to terms. Review and agree to the eBird API terms of use. Submit request. Get API key. You should receive an email containing your key. Start using the API. With your key in hand, you can authenticate requests. Important Place your API key inside of the birds_are_cool/.dlt/secrets.toml file.\nEdit the ebirdapi.py file # When you first get this file, there will be some helpful comments about how to structure the file, but ultimately it will look like this for our demo:\nStandard imports:\nimport dlt from dlt.sources.helpers import requests import requests as req # We want the exceptions from the package If you\u0026rsquo;ve never worked with an API before, this function might seem strange. We\u0026rsquo;re creating the header for an application programming interface. We are eventually going to ask the API for data, and this header will contain some metadata to authenticate us (headers are also used for other things as well).\ndef _create_auth_headers(api_secret_key): \u0026#34;\u0026#34;\u0026#34;Constructs Bearer type authorization header which is the most common authorization method.\u0026#34;\u0026#34;\u0026#34; headers = {\u0026#34;X-eBirdApiToken\u0026#34;: f\u0026#34;{api_secret_key}\u0026#34;} return headers Sources \u0026amp; resources:\nBoth are denoted by a decorator (fancy Python for \u0026ldquo;add this functionality to my subsequent function\u0026rdquo;). Sources are the high-level, logical grouping function for one or many resource functions. Resources are endpoints / tables / streams that are under the source in question. @dlt.source def ebirdapi_source(loc_code: str = \u0026#39;US-WA\u0026#39;, api_secret_key=dlt.secrets.value): return recent_observations(loc_code, api_secret_key) @dlt.resource(write_disposition=\u0026#34;replace\u0026#34;) def recent_observations(loc_code: str, api_secret_key=dlt.secrets.value): headers = _create_auth_headers(api_secret_key) ebird_api_url = f\u0026#39;https://api.ebird.org/v2/data/obs/{loc_code}/recent/notable?detail=full\u0026#39; try: response = req.get(ebird_api_url, headers=headers) response.raise_for_status() data = response.json() yield data except req.exceptions.RequestException as e: print(\u0026#34;Error fetching data from eBird API:\u0026#34;, e) yield [] Here we first create an instance of a dlt.pipeline, and then execute the class\u0026rsquo;s run method. On this rare occasion, the docs explain it best:\nThis method will extract the data from the data argument, infer the schema, normalize the data into a load package (i.e., jsonl or PARQUET files representing tables) and then load such packages into the destination.\nThe data may be supplied in several forms:\na list or Iterable of any JSON-serializable objects, e.g. dlt.run([1, 2, 3], table_name=\u0026quot;numbers\u0026quot;) any Iterator or a function that yields (Generator), e.g. dlt.run(range(1, 10), table_name=\u0026quot;range\u0026quot;) a function or a list of functions decorated with @dlt.resource, e.g. dlt.run([chess_players(title=\u0026quot;GM\u0026quot;), chess_games()]) a function or a list of functions decorated with @dlt.source. if __name__ == \u0026#34;__main__\u0026#34;: # Configure the pipeline with your destination details pipeline = dlt.pipeline( pipeline_name=\u0026#39;ebirdapi\u0026#39;, destination=\u0026#39;duckdb\u0026#39;, dataset_name=\u0026#39;ebirdapi_data\u0026#39; ) # Run the pipeline with your parameters load_info = pipeline.run(ebirdapi_source()) # Pretty print the information on data that was loaded print(load_info) Once you have this file in its entirety copied and pasted, you can change the location code in line 6 (to find birding hotspots near you) and then in the terminal with your poetry shell, run:\npython3 ebirdapi.py Data will be put into a new .duckdb instance and it\u0026rsquo;ll be ready to be checked out in DBeaver.\nQuerying data # Create a new connection and browse + select the new .duckdb you just made. Either view the database / schema / table layout in the lefthand pane, or go straight to a new worksheet and write whatever SQL you\u0026rsquo;d like. ","date":"1 June 2023","externalUrl":null,"permalink":"/posts/bird-finder-via-dlt-i/","section":"Posts","summary":"Loading eBird API data into DuckDB with dlt, then querying it in DBeaver. A first look at dlt as a Meltano/Airbyte alternative.","title":"dlt Demo: Bird Hotspot Finder, pt. I","type":"posts"},{"content":"I\u0026rsquo;m Connor. I\u0026rsquo;m a data engineer at Harness, based in Arizona. Day-to-day that means around fifty DLT pipelines, transformations in SQLMesh, deploying internal services with our security team, and building an AI app that sits on top of our warehouse and talks to Slack so the rest of the company can ask the data questions in plain English instead of pinging our centralized data team.\nI don\u0026rsquo;t write code because I like writing code. I write code because it\u0026rsquo;s the fastest way I\u0026rsquo;ve found to solve the problems I\u0026rsquo;ve found to be worth solving. That distinction matters to me. The good days are the ones where a pipeline makes someone\u0026rsquo;s job quieter or a query answers a question that used to require a meeting.\nOutside of work I hunt, which is how I ended up caring about geospatial data, public-land boundaries, and the strange politics of who owns what dirt. I keep an eye on what MotherDuck is shipping — they release features faster than I can find excuses to use them, which is a good problem.\nWhat I\u0026rsquo;m up for # Open to consulting engagements, especially anything sitting at the intersection of data engineering and geospatial. Curious about federal / GIS roles where the work is closer to land and resources than it is to ad-tech.\nWhere I\u0026rsquo;ve been # Conferences\ndbt Coalesce 2022 — New Orleans, LA dbt Coalesce 2023 — San Diego, CA WWDVC 2023 — Stowe, VT Data Saturday #52 — Oslo, Norway AI Council 2026 — San Francisco, CA Trainings\nData Vault 2.0 Practitioner — Helsinki, Finland Get in touch # Email — loughondata@protonmail.com GitHub — @Doctacon LinkedIn — crlough ","externalUrl":null,"permalink":"/about/","section":"Lough on Data","summary":"","title":"About","type":"page"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/projects/","section":"Projects","summary":"","title":"Projects","type":"projects"}]