Towards Data Science – Medium Your home for data science. A Medium publication sharing concepts, ideas and codes. – Medium

  • Radical Simplicity in Data Engineering
    by Cai Parry-Jones on July 26, 2024 at 2:09 pm

    Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinkingsource: unsplash.comRecently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were:Not knowing why something brokeGetting burnt with high cloud compute costsTaking too long to build data solutions/complete data projectsNeeding expertise on many tools and technologiesThese problems aren’t new. I’ve experienced them, you’ve probably experienced them. Yet, we can’t seem to find a solution that solves all of these issues in the long run. You might think to yourself, ‘well point one can be solved with {insert data observability tool}’, or ‘point two just needs a stricter data governance plan in place’. The problem with these style of solutions is they add additional layers of complexity, which cause the final two pain points to increase in seriousness. The aggregate sum of pain remains the same, just a different distribution between the four points.created by the author using Google SheetsThis article aims to present a contrary style of problem solving: radical simplicity.TL;DRSoftware engineers have found massive success in embracing simplicity.Over-engineering and pursuing perfection can result in bloated, slow-to-develop data systems, with sky high costs to the business.Data teams should consider sacrificing some functionality for the sake of simplicity and speed.A Lesson From Those Software GuysIn 1989, the computer scientist Richard P. Gabriel wrote a relatively famous essay on computer systems paradoxically called ‘Worse Is Better’. I won’t go into the details, you can read the essay here if you like, but the underlying message was that software quality does not necessarily improve as functionality increases. In other words, on occasions, you can sacrifice completeness for simplicity and end up with an inherently ‘better’ product because of it.This was a strange idea to the pioneers of computing during the 1950/60s. The philosophy of the day was: a computer system needs to be pure, and it can only be pure if it accounts for all possible scenarios. This was likely due to the fact that most leading computer scientists at the time were academics, who very much wanted to treat computer science as a hard science.Academics at MIT, the leading institution in computing at the time, started working on the operating system for the next generation of computers, called Multics. After nearly a decade of development and millions of dollars of investment, the MIT guys released their new system. It was unquestionably the most advanced operating system of the time, however it was a pain to install due to the computing requirements, and feature updates were slow due to the size of the code base. As a result, it never caught on beyond a few select universities and industries.While Multics was being built, a small group supporting Multics’s development became frustrated with the growing requirements required for the system. They eventually decided to break away from the project. Armed with this experience they set their sights on creating their own operating system, one with a fundamental philosophy shift:The design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.— Richard P. GabrielFive years after Multics’s release, the breakaway group released their operating system, Unix. Slowly but steadily it caught traction, and by the 1990s Unix became the go-to choice for computers, with over 90% of the world’s top 500 fastest supercomputers using it. To this day, Unix is still widely used, most notably as the system underlying macOS.There were obviously other factors beyond its simplicity that led to Unix’s success. But its lightweight design was, and still is, a highly valuable asset of the system. That could only come about because the designers were willing to sacrifice functionality. The data industry should not be afraid to to think the same way.Back to Data in the 21st CenturyThinking back at my own experiences, the philosophy of most big data engineering projects I’ve worked on was similar to that of Multics. For example, there was a project where we needed to automate standardising the raw data coming in from all our clients. The decision was made to do this in the data warehouse via dbt, since we could then have a full view of data lineage from the very raw files right through to the standardised single table version and beyond. The problem was that the first stage of transformation was very manual, it required loading each individual raw client file into the warehouse, then dbt creates a model for cleaning each client’s file. This led to 100s of dbt models needing to be generated, all using essentially the same logic. Dbt became so bloated it took minutes for the data lineage chart to load in the dbt docs website, and our GitHub Actions for CI (continuous integration) took over an hour to complete for each pull request.This could have been resolved fairly simply if leadership had allowed us to make the first layer of transformations outside of the data warehouse, using AWS Lambda and Python. But no, that would have meant the data lineage produced by dbt wouldn’t be 100% complete. That was it. That was the whole reason not to massively simplify the project. Similar to the group who broke away from the Multics project, I left this project mid-build, it was simply too frustrating to work on something that so clearly could have been much simpler. As I write this, I discovered they are still working on the project.So, What the Heck is Radical Simplicity?Radical simplicity in data engineering isn’t a framework or data-stack toolkit, it is simply a frame of mind. A philosophy that prioritises simple, straightforward solutions over complex, all-encompassing systems.Key principles of this philosophy include:Minimalism: Focusing on core functionalities that deliver the most value, rather than trying to accommodate every possible scenario or requirement.Accepting trade-offs: Willingly sacrificing some degree of completeness or perfection in favour of simplicity, speed, and ease of maintenance.Pragmatism over idealism: Prioritising practical, workable solutions that solve real business problems efficiently, rather than pursuing theoretically perfect but overly complex systems.Reduced cognitive load: Designing systems and processes that are easier to understand, implement, and maintain, thus reducing the expertise required across multiple tools and technologies.Cost-effectiveness: Embracing simpler solutions that often require less computational resources and human capital, leading to lower overall costs.Agility and adaptability: Creating systems that are easier to modify and evolve as business needs change, rather than rigid, over-engineered solutions.Focus on outcomes: Emphasising the end results and business value rather than getting caught up in the intricacies of the data processes themselves.This mindset can be in direct contradiction to modern data engineering solutions of adding more tools, processes, and layers. As a result, be expected to fight your corner. Before suggesting an alternative, simpler, solution, come prepared with a deep understanding of the problem at hand. I am reminded of the quote:It takes a lot of hard work to make something simple, to truly understand the underlying challenges and come up with elegant solutions. It’s not just minimalism or the absence of clutter. It involves digging through the depth of complexity. To be truly simple, you have to go really deep. You have to deeply understand the essence of a product in order to be able to get rid of the parts that are not essential.— Steve JobsSide note: Be aware that adopting radical simplicity doesn’t mean ignoring new tools and advanced technologies. In fact one of my favourite solutions for a data warehouse at the moment is using a new open-source database called duckDB. Check it out, it’s pretty cool.ConclusionThe lessons from software engineering history offer valuable insights for today’s data landscape. By embracing radical simplicity, data teams can address many of the pain points plaguing modern data solutions.Don’t be afraid to champion radical simplicity in your data team. Be the catalyst for change if you see opportunities to streamline and simplify. The path to simplicity isn’t easy, but the potential rewards can be substantial.Radical Simplicity in Data Engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • What We Still Don’t Understand About Machine Learning
    by Hesam Sheikh on July 26, 2024 at 2:07 pm

    Machine Learning unknowns that researchers struggle to understand — from Batch Norm to what SGD hidesContinue reading on Towards Data Science »

  • Python Concurrency — A Brain-Friendly Guide for Data Professionals
    by Dario Radečić on July 26, 2024 at 1:55 pm

    Moving data around can be slow. Here’s how you can squeeze every bit of performance optimization out of Python.Continue reading on Towards Data Science »

  • Visualizing Road Networks
    by Milan Janosov on July 26, 2024 at 1:53 pm

    How to use Python and OSMnx to create beautiful visuals of global cities’ road networks.Continue reading on Towards Data Science »

  • Data Modeling Techniques For Data Warehouse
    by Mariusz Kujawski on July 26, 2024 at 1:48 pm

    Photo by Zdeněk Macháček on UnsplashData modeling is a process of creating a conceptual representation of the data and its relationships within an organization or system. Dimensional modeling is an advanced technique that attempts to present data in a way that is intuitive and understandable for any user. It also allows for high-performance access, flexibility, and scalability to accommodate changes in business needs.In this article, I will provide an in-depth overview of data modeling, with a specific focus on Kimball’s methodology. Additionally, I will introduce other techniques used to present data in a user-friendly and intuitive manner. One particularly interesting technique for modern data warehouses is storing data in one wide table, although this approach may not be suitable for all query engines. I will present techniques that can be used in Data Warehouses, Data Lakes, Data Lakehouses, etc. However, it is important to choose the appropriate methodology for your specific use case and query engine.What is dimensional modeling?Every dimensional model consists of one or more tables with a multipart key, referred to as the fact table, along with a set of tables known as dimension tables. Each dimension table has a primary key that precisely corresponds to one of the components of the multipart key in the fact table. This distinct structure is commonly referred to as a star schema. In some cases, a more intricate structure called a snowflake schema can be used, where dimension tables are connected to smaller dimension tablesBenefits of dimensional modeling:Dimensional modeling provides a practical and efficient approach to organizing and analyzing data, resulting in the following benefits:Simplicity and understandability for business users.Improved query performance for faster data retrieval.Flexibility and scalability to adapt to changing business needs.Ensured data consistency and integration across multiple sources.Enhanced user adoption and self-service analytics.Now that we have discussed what dimensional modeling is and the value it brings to organizations, let’s explore how to effectively leverage it.Data and dimensional modeling methodologiesWhile I intend to primarily focus on Kimball’s methodology, let’s briefly touch upon a few other popular techniques before diving into it.Inmon methodologyInmon suggests utilizing a normalized data model within the data warehouse. This methodology supports the creation of data marts. These data marts are smaller, specialized subsets of the data warehouse that cater to specific business areas or user groups. These are designed to provide a more tailored and efficient data access experience for particular business functions or departments.Data vaultData Vault is a modeling methodology that focuses on scalability, flexibility, and traceability. It consists of three core components: the Hub, the Link, and the Satellite.HubsHubs are collections of all distinct entities. For example, an account hub would include account, account_ID, load_date, and src_name. This allows us to track where the record originally came from when it was loaded, and if we need a surrogate key generated from the business key.LinksLinks establish relationships between hubs and capture the associations between different entities. They contain the foreign keys of the related hubs, enabling the creation of many-to-many relationships.SatellitesSatellites store the descriptive information about the hubs, providing additional context and attributes. They include historical data, audit information, and other relevant attributes associated with a specific point in time.Data Vault’s design allows for a flexible and scalable data warehouse architecture. It promotes data traceability, auditability, and historical tracking. This makes it suitable for scenarios where data integration and agility are critical, such as in highly regulated industries or rapidly changing business environments.One big table (OBT)OBT stores data in one wide table. Using one big table, or a denormalized table, can simplify queries, improve performance, and streamline data analysis. It eliminates the need for complex joins, eases data integration, and can be beneficial in certain scenarios. However, it may lead to redundancy, data integrity challenges, and increased maintenance complexity. Consider the specific requirements before opting for a single large table.Image by authorImage by authorWITH transactions AS ( SELECT 1000001 AS order_id, TIMESTAMP(‘2017-12-18 15:02:00’) AS order_time, STRUCT(65401 AS id, ‘John Doe’ AS name, ‘Norway’ AS location) AS customer, [ STRUCT(‘xxx123456’ AS sku, 3 AS quantity, 1.3 AS price), STRUCT(‘xxx535522’ AS sku, 6 AS quantity, 500.4 AS price), STRUCT(‘xxx762222’ AS sku, 4 AS quantity, 123.6 AS price) ] AS orders UNION ALL SELECT 1000002, TIMESTAMP(‘2017-12-16 11:34:00’), STRUCT(74682, ‘Jane Smith’, ‘Poland’) AS customer, [ STRUCT(‘xxx635354’, 4, 345.7), STRUCT(‘xxx828822’, 2, 9.5) ] AS orders)select *fromtransactionsIn the case of one wide table we don’t need to join tables. We can use only one table to aggregate data and make analyzes. This method improves performance in BigQuery.Image by authorselect customer.name, sum(a.quantity)fromtransactions t, UNNEST(t.orders) as agroup by customer.nameKimball methodologyThe Kimball methodology places significant emphasis on the creation of a centralized data repository known as the data warehouse. This data warehouse serves as a singular source of truth, integrating and storing data from various operational systems in a consistent and structured manner.This approach offers a comprehensive set of guidelines and best practices for designing, developing, and implementing data warehouse systems. It places a strong emphasis on creating dimensional data models and prioritizes simplicity, flexibility, and ease of use. Now, let’s delve into the key principles and components of the Kimball methodology.Entity model to dimensional modelIn our data warehouses, the sources of data are often found in entity models that are normalized into multiple tables, which contain the business logic for applications. In such a scenario, it can be challenging as one needs to understand the dependencies between tables and the underlying business logic. Creating an analytical report or generating statistics often requires joining multiple tables.Image by authorTo create a dimensional model, the data needs to undergo an Extract, Transform, and Load (ETL) process to denormalize it into a star schema or snowflake schema. The key activity in this process involves identifying the fact and dimension tables and defining the granularity. The granularity determines the level of detail stored in the fact table. For example, transactions can be aggregated per hour or day.Image by authorLet’s assume we have a company that sells bikes and bike accessories. In this case, we have information about:TransactionsStoresClientsProductsBased on our business knowledge, we know that we need to collect information about sales volume, quantity over time, and segmented by regions, customers, and products. With this information, we can design our data model. The transactions’ table will serve as our fact table, and the stores, clients, and products tables will act as dimensional tables.Fact tableA fact table typically represents a business event or transaction and includes the metrics or measures associated with that event. These metrics can encompass various data points such as sales amounts, quantities sold, customer interactions, website clicks, or any other measurable data that offers insights into business performance. The fact table also includes foreign key columns that establish relationships with dimension tables.Image by authorThe best practice in the fact table design is to put all foreign keys on the top of the table and then measure.Fact Tables TypesTransaction Fact Tables gives a grain at its lowest level as one row represents a record from the transaction system. Data is refreshed on a daily basis or in real time.Periodic Snapshot Fact Tables capture a snapshot of a fact table at a point in time, like for instance the end of month.Accumulating Snapshot Fact Table summarizes the measurement events occurring at predictable steps between the beginning and the end of a process.Factless Fact Table keeps information about events occurring without any masseurs or metrics.Dimension tableA dimension table is a type of table in dimensional modeling that contains descriptive attributes like for instance information about products, its category, and type. Dimension tables provide the context and perspective to the quantitative data stored in the fact table.Dimension tables contain a unique key that identifies each record in the table, named the surrogate key. The table can contain a business key that is a key from a source system. A good practice is to generate a surrogate key instead of using a business key.There are several approaches to creating a surrogate key:-Hashing: a surrogate key can be generated using a hash function like MD5, SHA256(e.g. md5(key_1, key_2, key_3) ).-Incrementing: a surrogate key that is generated by using a number that is always incrementing (e.g. row_number(), identity).-Concatenating: a surrogate key that is generated by concatenating the unique key columns (e.g. concat(key_1, key_2, key_3) ).-Unique generated: a surrogate key that is generated by using a function that generates a unique identifier (e.g. GENERATE_UUID())The method that you will choose depends on the engine that you use to process and store data. It can impact performance of querying data.Dimensional tables often contain hierarchies.a) For example, the parent-child hierarchy can be used to represent the relationship between an employee and their manager.Image by authorb) Hierarchical relationships between attributes. For example, a time dimension might have attributes like year, quarter, month, and day, forming a hierarchical structure.Image by authorTypes of dimension tablesConformed Dimension:A conformed dimension is a dimension that can be used by multiple fact tables. For example, a region table can be utilized by different fact tables.Degenerate Dimension:A degenerate dimension occurs when an attribute is stored in the fact table instead of a dimension table. For instance, the transaction number can be found in a fact table.Junk Dimension:This one contains non-meaningful attributes that do not fit well in existing dimension tables, or are combinations of flags and binary values representing various combinations of states.Role-Playing Dimension:The same dimension key includes more than one foreign key in the fact table. For example, a date dimension can refer to different dates in a fact table, such as creation date, order date, and delivery date.Static Dimension:A static dimension is a dimension that typically never changes. It can be loaded from reference data without requiring updates. An example could be a list of branches in a company.Bridge Table:Bridge tables are used when there are one-to-many relationships between a fact table and a dimension table.Slowly changing dimensionA Slowly Changing Dimension (SCD) is a concept in dimensional modeling. It handles changes to dimension attributes over time in dimension tables. SCD provides a mechanism for maintaining historical and current data within a dimension table as business entities evolve and their attributes change. There are six types of SCD, but the three most popular ones are:SCD Type 0: In this type, only new records are imported into dimension tables without any updates.SCD Type 1: In this type, new records are imported into dimension tables, and existing records are updated.SCD Type 2: In this type, new records are imported, and new records with new values are created for changed attributes.For example, when John Smith moves to another city, we use SCD Type 2 to keep information about transactions related to London. In this case, we create a new record and update the previous one. As a result, historical reports will retain information that his purchases were made in London.Image by authorImage by authorMERGE INTO client AS tgtUSING ( SELECT Client_id,Name, Surname,City GETDATE() AS ValidFrom ‘20199-01-01’ AS ValidTofrom client_stg) AS srcON (tgt.Clinet_id = src.Clinet_id AND tgt.iscurrent = 1)WHEN MATCHED THEN UPDATE SET tgt.iscurrent = 0, ValidTo = GETDATE()WHEN NOT MATCHED THEN INSERT (Client_id, name, Surname, City, ValidFrom, ValidTo, iscurrent) VALUES (Client_id, name, Surname, City, ValidFrom, ValidTo,1);This is how SCD 3 looks when we keep new and previous values in separate columns.Star schema vs. snowflake schemaThe most popular approach to designing a data warehouse is to utilize either a star schema or a snowflake schema. The star schema has fact tables and dimensional tables that are in relation to the fact table. In a star schema, there are fact tables and dimensional tables that are directly related to the fact table. On the other hand, a snowflake schema consists of a fact table, dimension tables related to the fact table, and additional dimensions related to those dimension tables.Image by authorThe main differences between these two designs lie in their normalization approach. The star schema keeps data denormalized, while the snowflake schema ensures normalization. The star schema is designed for better query performance. The snowflake schema is specifically tailored to handle updates on large dimensions. If you encounter challenges with updates to extensive dimension tables, consider transitioning to a snowflake schema.Data loading strategiesIn our data warehouse, data lake, and data lake house we can have various load strategies like:Full Load: The full load strategy involves loading all data from source systems into the data warehouse. This strategy is typically used in the case of performance issues or lack of columns that could inform about row modification.Incremental Load: The incremental load strategy involves loading only new data since the last data load. If rows in the source system can’t be changed, we can load only new records based on a unique identifier or creation date. We need to define a “watermark” that we will use to select new rows.Delta Load: The delta load strategy focuses on loading only the changed and new records since the last load. It differs from incremental load in that it specifically targets the delta changes rather than all records. Delta load strategies can be efficient when dealing with high volumes of data changes and significantly reduce the processing time and resources required.The most common strategy to load data is to populate dimension tables and then fact tables. The order here is important because we need to use primary keys from dimension tables in fact tables to create relationships between tables. There is an exception. When we need to load a fact table before a dimension table, this technique name is late arriving dimensions.In this technique, we can create surrogate keys in a dimension table, and update it by ETL process after populating the fact table.SummaryAfter a thorough reading of the article, if you have any questions or would like to further discuss data modeling and effective dimensional models, feel free to reach out to me on LinkedIn. Implementing data modeling can unlock the potential of your data, providing valuable insights for informed decision-making while gaining knowledge in methods and best practices.Data Modeling Techniques For Data Warehouse was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

 

Newsletter

Join our newsletter to get the free update, insight, promotions.