We’ve all been there, tasked with transforming an utterly useless file into a data set that can be used and analyzed. My most infuriating example was a data set provided as two 1,500 page pdfs. Yes, you read that correctly, two 1500 page pdfs! What is anyone supposed to do with a pdf?!? It gets worse than that, though. The necessary data was formatted into three columns. Data fields would break randomly across lines. Some, but not all, contained characters like ¨or * between fields. There was no uniformity to the fields present for each entry or the order they were in. That was in my pre Power BI days, PPBI if you will. I had to copy the pdf file and paste is as Unicode text into a Word doc, where I removed the columns and irritating characters, then into Excel where the transformation took place. It took several days, and gallons of coffee, but the necessary information was eventually extracted into a beautiful, usable database. What a glorious thing to behold!

Unfortunately, dirty data isn’t a thing of the past, and it’s not going to disappear any time soon. Anyone who doesn’t work in data analytics probably has no idea of the importance of data formatting, or if they do they don’t fully understand how to properly format data. One large data set that I work with regularly is populated by a department that doesn’t do any analytics on the data they generate. That department is wholly invested in the status quo, as any changes would mean adopting new processes and could result in more work for them. Suggesting changes to processes and work flow can also be met with resistance from leadership as it raises the possibility that the current processes, adopted and sanctioned by those individuals, aren’t working for everyone. In a perfect business, all parts of the business would work together to ensure the most efficient workflow for the entire business, everyone would understand the importance of data and work to ensure the integrity of data sets, and change would be viewed as a positive, but no business is perfect. Analysts have to work with the data available and find a way to use dirty data to generate useful insights that assist in making high-level decisions about a business.

Fortunately, we now have better tools at our disposal to transform less-than-ideal (aka obnoxious) data sets into usable data. Power BI is leading the way with integrated features that make cleaning and normalizing your data a much less infuriating task than it once was. Instead of a grueling Spartan marathon, data transformation can now be a fun Sudoku puzzle. Power BI is making data transformation a much more palatable task.

Power Query and the Query Editor

Power BI’s Query Editor is such a versatile tool that I could probably write several articles detailing various aspects of its use, but for the purpose of this discussion I’m going to outline some of the ways that Query Editor can be used to transform and normalize data. Aside from its built-in transformation tools Query Editor allows for the use of R and SQL in addition to the Microsoft languages M, DAX, and MDX.

Data transformation begins during the import stage. Power Query and the Query Editor contain features that allow you to join files, change data types and formats, and remove columns during data import and the program will even detect some changes and automatically apply them during import. The ability to join multiple files that contain similar data (e.g. sales reports over several months) during import is an incredible efficiency tool. All of the transformation that occurs at that stage can be performed on multiple files at once, saving you from needing to repeat the process for each separate file and joining them in a later step.

Once you’ve imported your data the Query Editor provides numerous options for data transformation from the built-in menus. Using only the menu tools available you can promote your first row to header, remove or add columns, split columns in various ways (by various delimiters, size, or number of characters), pivot and unpivot data, and group and dedupe your data. Microsoft has a great overview of the Query Editor functionality on the Power BI Blog.

For all the die-hard Excel users out there, you should be familiar with the basic functionality available through Power Query in Excel; it’s virtually identical in Power BI, making the transition from Excel to Power BI seamless. For a great overview tutorial on using Power Query and the Query Editor, check out this video (the demonstration is shown in Excel, but again, the experience is very similar in Power BI).

Give me an M

If you want to go beyond the functionality of the built-in tools, you’ll need to learn M, or the language formerly known as the Power Query Formula Language. Most users will only use M indirectly via the Query Editor, but the code can be viewed and edited via the Advanced Editor. M is a mashup query language. According to the MSDN Power Query M Reference page, “The Power Query M formula language is optimized for building highly flexible data mashup queries.” Learning M is worth the effort for any Power BI user as it greatly expands one’s data wrangling and transformative capabilities. But don’t take my word for it. At the June 2017 Microsoft Data Insights Summit, in Seattle, Matt Masson explained how to use M to create a custom data connector, and Paul Turley and Brian Grant discussed advanced data transformation using M. If they can’t sell you on the benefits of learning M, nothing I can say will sway you.

For a thorough introduction to M, I recommend another lecture by Matt Masson, Deep Dive into Power Query Formula Language. (He references versions of this talk in the previous video.) Once you’ve explored the basics, the Power Query M Reference page offers a comprehensive overview of the language, including a function reference and a downloadable pdf of the Power Query Language Specification.

Keep Moving Forward with Aptude

Aptude is your own personal IT professional services firm. We provide our clients with first class resources in a continuous, cost-containment fashion.

Our support services will free up your senior IT staff from the overwhelming burden of day-to-day maintenance issues. Now they’ll have time to launch those new projects and applications you’ve been waiting for. Simply put, we can free up your resources and contain your costs. Let’s have a quick chat to discuss our exclusive services.

Contact Us

Can you use Power BI without SQL?

For the most part the only place a non-technical person will use SQL in Power BI is when retrieving data from a SQL server. Power Query allows for a SQL statement to be applied to the data being retrieved from the server, but most of the magic of linking tables can be performed via the navigator screen once the database is connected; statements are created by Power BI using M and can be viewed and edited using the advanced editor. There may be advantages to retrieving data via an SQL statement especially if you’re working with complex table structures. Blogger Chris Webb explains that the one major drawback to importing via an SQL statement is that all subsequent transformations take place only in Power BI and don’t make use of query folding to act on the source data. There may be some more interesting applications for SQL in Power BI, though I’ll admit that I haven’t played with it myself as the built-in functions of the Query Editor are so well designed.

R – It’s Not Just an Expression Used by Pirates

R is a powerful statistical language. It’s available as a free, open-source download that can be used as a standalone (within R Studio or R Client, for example) or within other environments, including Power BI. R users have created dozens of packages of script that are invaluable to analysts and can be used within Power BI manipulate and transform incomplete or inconsistent data into usable data sets. This cheat sheet from R studio shows some of the ways that R can be used to clean, reformat, summarize, group, and join data. (R Studio provides several other cheat sheets for other functions as well!)

If you’re new to R, don’t worry, you will have no problem finding a plethora of free information on the language and its use. A few great blogs to start with are: Microsoft R Application Network, The R Project for Statistical Computing, and R-Bloggers. If you find that you want a deeper dive into R, Salvatore S. Mangiafico published an e-book detailing the use of R for statistical analysis entitled, Summary and Analysis of Extension Program Evaluation in R.

One thing to note is that Power BI doesn’t support R out-of-box; you’ll need to download and install software to your computer before use, but it does support running R Script within the Query Editor. You can download R from MRAN, the Microsoft R Application Network, or CRAN, the Comprehensive R Archival Network, R Client is available for Windows and Linux, and a free version of R Studio is available on their website.

Conclusion

Dirty data is an inevitable part of any analyst’s daily life. A necessary part of the work of data analytics is mining diamonds from piles of garbage, which is to say, extracting the useful data from mountains of largely irrelevant information and then transforming that data into something that can be analyzed and visualized. We all know how challenging, rewarding, and often frustrating this part of the work can be. While the world around us is slowly being outfitted with data infrastructures that collect data in usable formats, many outdated data collection tools and systems remain. While data collection catches up to meet the demands of current analytical capabilities we need to find ways to work within the existing data landscape. Power BI offers the next generation of data transformation capabilities that assist in the tedious process of finding actionable insights from the vast oceans of data now readily available.

What Are You Working On?

Looking for intelligent technological solutions? Seeking consultation on your upcoming projects? Need a quote for services? Contact Aptude’s executive team directly. It’s amazing just how much one little email can rapidly accelerate your productivity.

Guy DeRosa

Senior Vice President

EMAIL GUY

p: 630.692.6700  ext. 406

Srinath Parepally

Vice President

EMAIL SRINATH

p: 630.692.6700  ext. 402

Uday Mehta

Vice President of Development

EMAIL UDAY

p: 630.692.6700. ext. 403

Diana Guzman

Communications Administrator

EMAIL DIANA

p: 630.692.6700  ext. 409

Testimonials

Aptude provides onsite and offshore Oracle DBA support, which includes troubleshooting, back-up, recovery, migration, upgrades, and daily maintenance of Oracle database servers. Aptude has been working with our team for the past four years and we continue to use them and are satisfied with their work

Warren E., Schneider National


Aptude provided us with Oracle DBA migration support, including an upgrade from Oracle 11.1 to Oracle 11.2, and the project was completed on time and to specifications. The project manager and project consultants were responsive and proactive, resulting in a successful conclusion to the work. I would definitely contract with them again, and have recommended them to other technical offices at the University of Georgia.

David C., University of Georgia

It’s amazing how one quick email can change your life. Give us a shout! We’ll get back to you right away with the right person for what you’re looking to accomplish.