Big Data can be useless without multi-layer data aggregations, hierarchical or cube-like intermediary Data Structures, when ONLY a few dozens, hundreds or thousands data-points exposed visually and dynamically every single viewing moment to analytical eyes for interactive drill-down-or-up hunting for business value(s) and actionable datum (or “datums” – if plural means data). One of best expression of this concept (at least how I interpreted it) I heard from my new colleague who flatly said:
“Move the function to the data!”
I got recently involved with multiple projects using large data-sets for Tableau-based Data Visualizations (100+ millions of rows and even Billions of records!). Some of largest examples of their sizes I used were: 800+ millions of records and other was 2+ billions of rows.
So this blog post is to express my thoughts about such Big Data (in average examples above have about 1+ KB per CSV record before compression and other advanced DB tricks, like columnar Databases used by Data Engine of Tableau) as back-end for Tableau.
Here are some Factors involved into Data Delivery from main and designated Database (Back-ends like Teradata, DB2, SQL Server or Oracle) for Tableau-based Big Data Visualizations) into “local” Tableau Visualizations (many people still trying to use Tableau as a Reporting tool as oppose to (Visual) Analytical Tool:
- Queuing thousands of Queries to Database Server. There is no guarantee your Tableau query will be executed immediately; in fact it WILL be delayed.
- Speed of Tableau Query when it will start to be executed depends on sharing CPU cycles, RAM and other resources with other queries executed SIMULTANEOSLY with your query.
- Buffers, pools and other resources available for particular user(s) and queries at your Database Server are different and depends on privileges and settings given to you as a Database User
- Network speed: between some servers it can be 10Gbits (or even more), in most cases it is 1Gbit inside server rooms, outside of server rooms I observed in many old buildings (over wired Ethernet) max 100Mbits coming into user’s PC; in case if you using Wi-Fi it can be even less (say 54 Mbits?). If you are using internet it can be even less (I observed speed in some remote offices as 1 Mbit or so over old T-1 lines); if you using VPN it will max out at 4Mbits or less (I observed it in my home office).
- Utilization of network. I use Remote Desktop Protocol – RDP to VM (from my workstation or notebook; (VM or VDI Virtual Machine, sitting in server room) and connected to servers with network speed of 1Gbit, but it still using maximum 3% of network speed (about 30 MBits, which is about 3 Megabytes of data per second, which is probably about few thousands of records per seconds.
That means that network may have a problem to deliver 100 millions of records to “local” report overnight (say 10 hours, 10 millions of records per hour, 3000 records per second) – partially and probably because of factors 4 above.
On top of those factors please keep in mind that Tableau is a set of 32-bit applications (with exception of one out of 7 processes on Server side), which is restricted to 2GB of RAM; if data-set cannot fit into RAM, than Tableau Data Engine will use the disk as Virtual RAM, which is much, much slower and for some users such disk space actually not local to his/her workstation and mapped to some “remote” network file server.
Tableau desktop is using in many cases 32-bit ODBC drivers, which may even add more delay into data delivery into local “Visual Report”. As we learned from Tableau support itself, even with latest Tableau Server 7.0.X, the RAM allocated for one user session restricted to 3GB anyway.
Unfortunate Update: Tableau 8.0 will be 32-bit application again, but may be follow up version 8.x or 9 (I hope) will be ported to 64-bits… It means that Spotfire, Qlikview and even PowerPivot will keep some advantages over Tableau for a while…