Why Learn Data Engineering?

Scholars of International Relations (IR) should learn about the bare minimum capabilities of data engineering, including cloud computing ecosystems. The guide here below should serve as a “quick start” to rapidly get anyone up to speed. However, there is a lot more under the sun than is covered here in this simple, introductory primer. The primer’s intended audience is for graduate students in IR, but may be useful for anyone who can allocate some time to develop new methodological skills.

Why bother learning data engineering on the cloud? This is the rationale:
1. The best publications make use of novel datasets.
2a. Novel datasets are difficult to gather, process, and analyze — you are not exactly a social scientist nearly as much as a data engineer.
2b. Let me re-emphasize: much of academia is about data engineering, not just data analysis. You will likely burn much more time and energy constructing a database relative to analyzing it. Data engineering is about constructing an automated system to obtain, clean, and then have the data ready for analysis.
3. Data analysis is always taught; data engineering is rarely taught.
4. The best way to ameliorate the labor-intensive work out of database creation is by leveraging a large suite of advanced computational tools. These tools include ways to create automated systems connected to data pipelines; e.g., from a script that makes several thousand API calls from a social media interface, and automatically cleans it, and automatically stores it, and automatically analyzes it continuously.
5. It is unlikely that any one individual could ever buy the full suite of these tools.
6. Good news: Some companies offer these advanced computational tools for rent.
7. These “landlords” of advanced compute are known as “cloud computing providers”. They include Amazon Web Services (AWS), Google Cloud Computing (GCC), etc.
8. These providers have a myriad of different tools; getting wise to them is traditionally a slog.
9. You might be better off if you had a guide to at least get you started.
10. This page is supposed to be a very streamlined such guide. It contains only a set of videos; and the topic is not covered comprehensively. But it should allow for any IR scholar to get their foot through the door and begin thinking about engineering, not just analysis.

Step 1: Basics

First, watch these videos to get appraised on the topic.
1. Skim this from IBM. Link.
2. What does an actual, non-academic data engineer do? DITL from Seattle Data Guy on YouTube. Link.
3. Pipelines are what matter. Link.
4. Quick-ish overview from FreeCodeCamp.org. Link.

Second, diving into the basics.
1. Johnny Chivers on YouTube has an excellent starter video. Link.
2. (Optional) from FreeCodeCamp.org. This is more generally about the AWS ecosystem, not necessarily specific to data engineering. Link.
3. CodeAcademy has a user-friendly course on data engineering. Relatively lower on the cloud computing components. Link.
4. What is everything under the sun? AWS cheat sheet by ByteByteGo. Link.
5. ELK stacks in AWS. Link.

Third, your own capstone mini project. For better or for worse, this is the best way to learn — just to do it yourself. Again, just be mindful of costs.
1. Pick a topic you like. Maybe it’s related to something you picked up in the INT Hub.
2. First, get the data sorted into an S3 bucket. This is how an S3 bucket works: link.
3. Next, use AWS Glue to process the data. This is how Glue works: link.
4. (Optional, recommended) use a Lambda function to automatically process the data — including any new data added to the pipeline. This is how a lambda function works: link.
5. (Optional) Train a machine learning model using AWS SageMaker. Once your data is processed, you can take it a step further by using AWS SageMaker to build, train, and deploy machine learning models. Link.

Step 2: Advanced

Warning: everything below is only for advanced and specialized cases. It is only if you were curious as to how to increase and improve automation and efficiency in managing gargantuan datasets from different sources combined with your own specialized cloud tools to collect data. To put it more bluntly: it may not be cost-beneficial to proceed below unless you yourself are actively going to construct a sensing instrument, e.g., a device that collects raw, unprocessed data from the wild. Even then, the steps above might be sufficient anyway.

1. FreeCodeCamp.org again. Link.
2. TechWorld with Nana videos. Link and link.
3. Azure certificate on cybersecurity architecture; AWS and GCC will have their own as well. Link.