Data specialists out there, curious to get your take

The company I work at is writing an API call, but rather than doing an extract to a data lake (which the company doesn't have) they are going to transform the data as part of the API call (python) before piping it into the data warehouse (AWS redshift).

This team has been driving everyone crazy wanting to know every possible use case for the data, including a bunch of internally defined attributes that aren't part of the raw data source, since they have to write an API script to churn out exactly the finished datatable

This seems like a bad idea to me...as soon as someone wants a new column or wants to revise some historical attribute, someone is going to get stuck rewriting what sounds like a messy API (the last update I got was they need to make 4 API calls just to join a basic table before adding our internal custom data). Is that a fair guess as to why each stage of ETL is usually kept separate?

Top | New | Old

Northwest · M

In each system, you have data, and code that interacts with the data.

For instance, you could have a list of of authors in one table, using a MySQL database, and a list of the books they wrote in another table, using Postgres, and another list of sales of these books, in a table in a Microsoft SQL database, and yet another list of the history of sales in yet another DB technology.

Middleware, interfaces with these different tables, using various DB technologies, and present it as an interface, presented as an API, that can be accessed, once the proper security is implemented. Usually the language is not important (Python, PHP, C#, etc), because they will all be using the same protocol to access the middleware.

And these APIs are designed to address things like "give me a list of authors", or "give me the sales per author last month", etc.

Providing a clear set of requirements, helps the developers provide you with better results.

Add a comment...