There are so many different configurations possible for data infrastructure. It will differ based on whether you are in the cloud, on premise, or in a hybrid situation. Here is an example infrastructure in the cloud:
This is listed in reverse order from the end users perspective
- Data Visualization tools (ie Tableau, MicroStrategy)
- An SQL interface (command line, squirrel, Toad, Mode Analytics)
- Notebook server for data scientist and power users. This could be used for data science experimentation or for post/pre lite ETL.
- Data Warehouse (Redshift, Snowflake). This is where you store your mostly curated and purposefully modeled data.
- ETL (Extract, Transform, and Load) Tools to move data from source systems to data lakes to data warehouses to BI tools. (Informatica, python, Ab Initio)
- Scheduling tools to automate ETL, reporting, and data science operations (UC4)
- A data lake that serves as both the landing area of all new data arriving from your source systems and as the historical archive of source data. This is also where you will stage and run your ETL processes. (Databricks, CloudEra, Spark, Hive, Hadoop)
- All of these tools need to be run on server clusters of varying size depending on their use cases and performance needs.
Did I get everything?