We’ve all been there, tasked with transforming an utterly useless file into a data set that can be used and analyzed. My most infuriating example was a data set provided as two 1,500 page pdfs. Yes, you read that correctly, two 1500 page pdfs! What is anyone supposed to do with a pdf?!? It gets worse than that, though. The necessary data was formatted into three columns. Data fields would break randomly across lines. Some, but not all, contained characters like ¨or * between fields. There was no uniformity to the fields present for each entry or the order they were in. That was in my pre Power BI days, PPBI if you will. I had to copy the pdf file and paste is as Unicode text into a Word doc, where I removed the columns and irritating characters, then into Excel where the transformation took place. It took several days, and gallons of coffee, but the necessary information was eventually extracted into a beautiful, usable database. What a glorious thing to behold!
Unfortunately, dirty data isn’t a thing of the past, and it’s not going to disappear any time soon. Anyone who doesn’t work in data analytics probably has no idea of the importance of data formatting, or if they do they don’t fully understand how to properly format data. One large data set that I work with regularly is populated by a department that doesn’t do any analytics on the data they generate. That department is wholly invested in the status quo, as any changes would mean adopting new processes and could result in more work for them. Suggesting changes to processes and work flow can also be met with resistance from leadership as it raises the possibility that the current processes, adopted and sanctioned by those individuals, aren’t working for everyone. In a perfect business, all parts of the business would work together to ensure the most efficient workflow for the entire business, everyone would understand the importance of data and work to ensure the integrity of data sets, and change would be viewed as a positive, but no business is perfect. Analysts have to work with the data available and find a way to use dirty data to generate useful insights that assist in making high-level decisions about a business.
Fortunately, we now have better tools at our disposal to transform less-than-ideal (aka obnoxious) data sets into usable data. Power BI is leading the way with integrated features that make cleaning and normalizing your data a much less infuriating task than it once was. Instead of a grueling Spartan marathon, data transformation can now be a fun Sudoku puzzle. Power BI is making data transformation a much more palatable task.
Power Query and the Query Editor
Power BI’s Query Editor is such a versatile tool that I could probably write several articles detailing various aspects of its use, but for the purpose of this discussion I’m going to outline some of the ways that Query Editor can be used to transform and normalize data. Aside from its built-in transformation tools Query Editor allows for the use of R and SQL in addition to the Microsoft languages M, DAX, and MDX.
Data transformation begins during the import stage. Power Query and the Query Editor contain features that allow you to join files, change data types and formats, and remove columns during data import and the program will even detect some changes and automatically apply them during import. The ability to join multiple files that contain similar data (e.g. sales reports over several months) during import is an incredible efficiency tool. All of the transformation that occurs at that stage can be performed on multiple files at once, saving you from needing to repeat the process for each separate file and joining them in a later step.
Once you’ve imported your data the Query Editor provides numerous options for data transformation from the built-in menus. Using only the menu tools available you can promote your first row to header, remove or add columns, split columns in various ways (by various delimiters, size, or number of characters), pivot and unpivot data, and group and dedupe your data. Microsoft has a great overview of the Query Editor functionality on the Power BI Blog.
For all the die-hard Excel users out there, you should be familiar with the basic functionality available through Power Query in Excel; it’s virtually identical in Power BI, making the transition from Excel to Power BI seamless. For a great overview tutorial on using Power Query and the Query Editor, check out this video (the demonstration is shown in Excel, but again, the experience is very similar in Power BI).
Give me an M
If you want to go beyond the functionality of the built-in tools, you’ll need to learn M, or the language formerly known as the Power Query Formula Language. Most users will only use M indirectly via the Query Editor, but the code can be viewed and edited via the Advanced Editor. M is a mashup query language. According to the MSDN Power Query M Reference page, “The Power Query M formula language is optimized for building highly flexible data mashup queries.” Learning M is worth the effort for any Power BI user as it greatly expands one’s data wrangling and transformative capabilities. But don’t take my word for it. At the June 2017 Microsoft Data Insights Summit, in Seattle, Matt Masson explained how to use M to create a custom data connector, and Paul Turley and Brian Grant discussed advanced data transformation using M. If they can’t sell you on the benefits of learning M, nothing I can say will sway you.
For a thorough introduction to M, I recommend another lecture by Matt Masson, Deep Dive into Power Query Formula Language. (He references versions of this talk in the previous video.) Once you’ve explored the basics, the Power Query M Reference page offers a comprehensive overview of the language, including a function reference and a downloadable pdf of the Power Query Language Specification.