Part 1: Pipeline Automation Overview

3 mins

Longing to put your knowledge from our workflow guide into practice? Then follow this tutorial to implement a fully automated workflow to conduct sentiment analysis on tweets, using our GitHub workflow template.

Objectives of this tutorial

  • Familiarize yourself with a robust directory structure for data-intensive projects
  • Experience the benefits of automating workflows with makefiles/GNU make
  • Learn to use Git templates for your own research projects
  • Adjust the workflow template to - ...download different datasets from the web - ...unzip data automatically - ...parse JSON objects and select relevant attributes - ...add new text mining metrics to the final data set using Python's textblob - ...modify the analysis in an RMarkdown/html document

Prerequisites

  • Computer setup following our setup instructions. - Python and the textblob package

        <div class="highlight"><pre class="chroma"><code class="language-fallback" data-lang="fallback">pip install -U textblob</code></pre></div>
    
        Then, open Python in the terminal by typing `python`, and type
    
        <div class="highlight"><pre class="chroma"><code class="language-fallback" data-lang="fallback">import nltk
        nltk.download('punkt')</code></pre></div>
    
        If you receive an error message, please verify you are typing this command in python, and not *directly* in the terminal/Anaconda prompt.
    
    - <a href="/install/r" alt="R, RStudio">R, RStudio</a> and the following packages:
    
        <div class="highlight"><pre class="chroma"><code class="language-fallback" data-lang="fallback">install.packages(c("data.table", "knitr", "Rcpp", "ggplot2", "rmarkdown"))</code></pre></div>
    
        When installing the packages, R may ask you to select a "CRAN-Mirror". This is the location of the package repository from which R seeks to download the packages. Either pick `0-Cloud`, or manually choose any of the location nearest to your current geographical location.
    
    Warning

    **R 4.0**.
    Newer versions of R (>=R 4.0) may require you to download additional packages.
    
    
    install.packages(c("rlang", "pillar"))
    - If you're being asked whether to build these packages from source or not [options: yes/no], select NO. - If you're being asked to install RTools, please do follow these installation instructions.

  • Familiarity with our workflows, in particular on pipelines and project components, directory structure and pipeline automation.

  • Nice-to-haves: - Basic experience with Python and R - Familiarity with common data operations using data.table in R - Familiarity with text mining using Python and TextBlob - If you want to learn Git on the way... - Have Git installed on your computer (see here) - Have GitHub login credentials

Disclaimer

To keep this tutorial as accessible as possible, it will mention Git/GitHub a few times, but assume you will acquire details on these skills elsewhere. In other words, versioning and contributing to Git repositories is not part of this tutorial.