How to Read Dataset Into Rstudio on Mac

championship writer tutorial-id experience persona source use case technology release environment product serial

Advanced Analytics With SparkR In Rstudio

sandbox-squad

759

Intermediate

Data Scientist & Analyst

Hortonworks

Data Discovery

Apache Spark, Apache SparkR, Machine Learning

hdp-3.0.1

Sandbox

HDP

HDP > Develop with Hadoop > Apache Spark

Avant-garde Analytics With SparkR In RStudio

Introduction

R is a popular tool for statistics and data assay. Information technology has rich visualization capabilities and a large drove of libraries that have been developed and maintained by the R developer community. One drawback to R is that it'south designed to run on in-memory data, which makes it unsuitable for big datasets.

Spark is a distributed engine for processing many Terabytes of data. It is a versatile tool with capabilities for data processing, SQL assay, streaming and auto learning. Considering Spark is a distributed framework a Hortonworks cluster running Spark can process many Terabytes of data in a short amount of time.

SparkR combines the benefits of Spark and R by assuasive Spark jobs to be called from within R. This allows the analyst to leverage Spark's ability to build aggregate statistics over large, multi-Terabyte datasets and so bring the smaller aggregated data dorsum into R for visualization and analysis.

In this tutorial we'll show you how to leverage SparkR to gain insights on airline delays.

Prerequisites

  • Downloaded and deployed the Hortonworks Information Platform (HDP) Sandbox
  • Learning the Ropes of the HDP Sandbox to get used to Sandbox
  • Follow this article on Hortonworks Community to install RStudio on the Sandbox

Outline

  • Download the Dataset
  • Setup SparkR on RStudio
  • Prepare a Preparation Dataset
  • Exploratory Data Analysis
  • Summary
  • Further Reading

Download the Dataset

Every yr approximately 20% of airline flights are delayed or cancelled, resulting in meaning costs to both travelers and airlines. As our instance employ-case, we volition build a supervised learning model that predicts airline filibuster from historical flight data. Download the dataset from hither and which includes details about flights in the U.s. for 2015. Every row in the dataset has xvi attributes:

  • Year
  • Month
  • Day of Month
  • Day of Week
  • Flight Number
  • Origin
  • Destination
  • Deviation Time
  • Departure Delay
  • Arrival Time
  • Arrival Delay
  • Cancelled
  • Cancellation Lawmaking
  • Air Time
  • Distance

Subsequently you download the file, unzip it and upload train_df.csv and test_df.csv to the /tmp directory of HDFS using Files View. Open Ambari past going to http://sandbox-hdp.hortonworks.com:8080 and use the credentials amy_ds/amy_ds to sign on. Once in Ambari, select "Files View" from the second menu in the upper right hand corner.

amy-files-view

Click on /tmp folder and upload these two files. Your screen should await similar this:

Files View 2

Setup SparkR on RStudio

Adjacent, let u.s.a. login to RStudio using credentials amy_ds/amy_ds. We take to create a SparkContext object which connects the R plan to the cluster. You tin create it using sparkR.init(). We likewise need a SqlContext object to piece of work with data frames which can be created from SparkContext.

Permit us beginning with creating an environment variable SPARK_HOME which has the location of Spark Libraries. We volition load the SparkR parcel and we invoke sparkR.init() function to create SparkContext. We are likewise adding some Spark Driver properties and csv package so that the SparkR data frame can read csv files.

Type the following lines on RStudio console:

Sys.setenv(SPARK_HOME              =                              "/usr/hdp/current/spark2-client"              ) library(SparkR,              lib.loc              =              c(file.path(Sys.getenv(                "SPARK_HOME"              ),                              "R"              ,                              "lib"              )))              sc              <-              sparkR.session(principal              =                              "local[*]"              ,              sparkEnvir              =              list(spark.driver.memory              =                              "2g"              ),sparkPackages              =                              "com.databricks:spark-csv_2.10:1.iv.0"              )              sqlContext              <-              sparkR.session()

R1

Press Enter

Prepare a Training dataset

Before moving ahead, bank check out the SparkR documentation to go used to SparkR API. You lot can either create a SparkR dataframe from the local R data frame or data sources in formats like csv or from Hive tables. Nosotros are going to use read.df() function to read the file from a data source (HDFS in our example), retain the schema and create the SparkR data frame. Type this line to create a dataframe taking the data from /tmp/train_df.csv file with headers included

              train_df              <-              read.df(                "/tmp/train_df.csv"              ,                "csv"              ,              header              =                              "truthful"              ,              inferSchema              =                              "true"              )

After it gets loaded, you lot tin can view the dataframe by:

R2

Information technology shows the summit 6 records of the dataframe, you can also run into the variables and the type of variables to the right side of the screen.

R3

Adjacent, let us endeavor to add some more columns to our dataframe to make the data more powerful and informative.

Nosotros will begin with deciding whether the following data is weekend or not. If information technology is weekend, enter 1 or else 0. The weekend starts from Friday and ends on Sunday. Enter the following:

              train_df              $              WEEKEND              <-              ifelse(train_df              $              DAY_OF_WEEK              ==              5              |              train_df              $              DAY_OF_WEEK              ==              half dozen              |              train_df              $              DAY_OF_WEEK              ==              7,1,0)

It uses ifelse() function which checks whether the value of DAY_OF_WEEK variable is 5, 6 or 7 and adds the value(one,0) to the new column WEEKEND corresponding to that.

Adjacent, create a new column called DEP_HOUR which volition have extracted hour value from DEP_TIME column.

              train_df              $              DEP_HOUR              <-              flooring(train_df              $              DEP_TIME              /              100)

At present, permit us introduce i more column called DELAY_LABELED which has value one if the inflow delay(ARR_DELAY) is more 15 minutes and 0 if ARR_DELAY is less than fifteen minutes. That means all flights which are arrived 15 minutes delayed are considered to exist delayed.

              train_df              $              DELAY_LABELED              <-              ifelse(train_df              $              ARR_DELAY              >              xv,              1,              0)              train_df              $              DELAY_LABELED              <-              cast(train_df              $              DELAY_LABELED,                "integer"              )

We will go on but those flight records where information technology did not become cancelled. In the side by side statement, nosotros are filtering out those records where the value of CANCELLED was 1

              train_df              <-              train_df[train_df              $              CANCELLED              ==              0,]

Next cleansing will be for NA values. Afterward looking a dataset for a while, yous will run across that there are lot of NA values in ARR_DELAY cavalcade. We should keep merely those where nosotros take valid readings of ARR_DELAY.

              train_df              <-              dropna(train_df,cols              =                              'ARR_DELAY'              )

Next, if yous want to know the datatype of columns in SparkR dataframe, just type

You should see post-obit console output:

R4

We should convert the type of ARR_DELAY and DEP_DELAY from string to integer so that we tin can perform mathematical operations on that.

              train_df              $              ARR_DELAY              <-              cast(train_df              $              ARR_DELAY,                "integer"              )              train_df              $              DEP_DELAY              <-              bandage(train_df              $              DEP_DELAY,                "integer"              )

Type the command below to view the prepared dataframe:

R5

Exploratory Data Analysis

At the terminate of this tutorial, we will be able to predict which flight is probable to be delayed. We tin classify our dataset into two values- 0 or one (0 for flights on time and one for flights delayed). Merely before creating a model, let united states of america visualize the data what we have correct now.

Nosotros create a new dataframe called delay which will have two columns, DELAY_LABELED and the count of it. Basically it volition have a count of delayed flights and ontime flights. We volition exist using aggregate function of SparkR where we grouping the dataframe by DELAY_LABELED and calculating the count using n().

              delay              <-              agg(group_by(train_df,              train_df              $              DELAY_LABELED),              count              =              n(train_df              $              DELAY_LABELED))

Introduce a new column called STATUS which will have value ontime if DELAY_LABELED is 0 and delayed if DELAY_LABELED is 1.

              delay              $              STATUS              <-              ifelse(delay              $              DELAY_LABELED              ==              0,                              "ontime"              ,                              "delayed"              )

Delete a outset column DELAY_LABELED because nosotros do non need it anymore.

Next, let us convert this SparkR dataframe to R dataframe using as.data.frame() office to visualize it using ggplot, permit the states telephone call this new dataframe delay_r.

              delay_r              <-              as.data.frame(filibuster)

Add together Percentage equally 1 more column to this new dataframe.

              delay_r              $              Percentage              <-              (delay_r              $              count              /              sum(delay_r              $              count))              *              100              delay_r              $              Percentage              <-              round(delay_r              $              Percentage,2)

View the dataframe:

R6

Next, install and import the package called ggplot2. ggplot2 is a plotting organization for R based on the grammar of graphics. You tin plot graphs similar bar chart, stacked bar chart, line chart, pie chart, scatter plot and histograms.

install.packages(                "ggplot2"              ) library(ggplot2)

Expect for it to get completed. Create a bare theme to delete the axis titles and ticks and setting the size for plot championship.

              blank_theme              <-              theme_minimal()+              theme(              axis.title.x              =              element_blank(),              axis.title.y              =              element_blank(),              panel.edge              =              element_blank(),              console.grid              =element_blank(),              centrality.ticks              =              element_blank(),              plot.title              =element_text(size              =              14,              face              =                              "bold"              )     )

We will draw a pie chart showing the percent of delayed and ontime flights.

ggplot(delay_r, aes(x              =                              "                "              ,y              =              Percentage,fill              =              STATUS))              +              geom_bar(stat              =                              "identity"              ,width              =              ane,colour              =                              "green"              )              +              coord_polar(theta              =                              "y"              ,offset              =              0)              +              blank_theme              +              ggtitle(                "Pie Chart for Flights"              )              +              theme(axis.text.10              =element_blank())              +              geom_text(aes(y              =              Percentage              /              two,label              =              paste0(Per centum,                "%"              ),hjust              =              ii))

R7

Click on Zoom at the top of the chart to have a clearer view.

This graph shows that around 18.17% flights are getting delayed which is a very large effigy.

Permit us explore what effect Day_Of_Week has on the dataset. We will create two new dataframes chosen delay_flights and non_delay_flights which volition accept details for delayed and ontime flights respectively.

              delay_flights              <-              filter(train_df,train_df              $              DELAY_LABELED              ==              1)              non_delay_flights              <-              filter(train_df,train_df              $              DELAY_LABELED              ==              0)

Next, we will discover the count of delayed and ontime flights grouped by Day_Of_Week.

              delay_flights_count              <-              agg(group_by(delay_flights,delay_flights              $              DAY_OF_WEEK),              DELAY_COUNT              =              due north(delay_flights              $              DELAY_LABELED))              non_delay_flights_count              <-              agg(group_by(non_delay_flights,non_delay_flights              $              DAY_OF_WEEK),              NON_DELAY_COUNT              =              n(non_delay_flights              $              DELAY_LABELED))

Now, nosotros tin merge both delay_flights_count and non_delay_flights_count dataframes.

              dayofweek_count              <-              merge(delay_flights_count,              non_delay_flights_count,              past.delay_flights_count              =              DAY_OF_WEEK,              by.non_delay_flights_count              =              DAY_OF_WEEK)

When you lot merge two dataframes, y'all get common cavalcade twice in the dataframe which is not required. Let us delete that past typing:

              dayofweek_count              $              DAY_OF_WEEK_y              <-              Nix            

Rename the column using withColumnRenamed() function.

              dayofweek_count              <-              withColumnRenamed(dayofweek_count,                "DAY_OF_WEEK_x"              ,                "DAY_OF_WEEK"              )

Convert this SparkR Dataframe to R dataframe and then that we can run visualization on information technology:

              dayofweek_count_r              <-              equally.data.frame(dayofweek_count)

Permit us view this new R dataframe:

R8

Innovate two columns, Delayed and Ontime, which have the percentage values for DELAY_COUNT and NON_DELAY_COUNT respectively.

              dayofweek_count_r              $              Delayed              <-              (dayofweek_count_r              $              DELAY_COUNT              /(dayofweek_count_r              $              DELAY_COUNT              +              dayofweek_count_r              $              NON_DELAY_COUNT))              *              100              dayofweek_count_r              $              Ontime              <-              (dayofweek_count_r              $              NON_DELAY_COUNT              /(dayofweek_count_r              $              DELAY_COUNT              +              dayofweek_count_r              $              NON_DELAY_COUNT))              *              100              dayofweek_count_r              <-              dayofweek_count_r[,-              2              :              -              three]

Next, add one more than column which represents the ratio of delayed flights against ontime flights.

              dayofweek_count_r              $              Ratio              <-              dayofweek_count_r              $              Delayed              /              dayofweek_count_r              $              Ontime              *              100              dayofweek_count_r              $              Ratio              <-              round(dayofweek_count_r              $              Ratio,two)

At present, if y'all await closely, our data is in wide format. The data is said to be in wide format if at that place is one observation row per subject with each measurement present as a different variable. We have to change it to long format which ways there is one ascertainment row per measurement thus multiple rows per subject field. In R, we use reshape to exercise this:

library(reshape2)              DF1              <-              melt(dayofweek_count_r,              id.var              =                              "DAY_OF_WEEK"              )              DF1              $              Ratio              <-              DF1[15              :              21,3]

View this new long format dataframe:

R9

We will change this dataframe merely to make the plot clearer.

              DF1              <-              DF1[-              15              :              -              21,]              DF1[viii              :              fourteen,4]              <-              NA            

Next, run the post-obit line to see the stacked bar chart:

install.packages(                "ggrepel"              ) library(ggrepel) ggplot(DF1, aes(x              =              DAY_OF_WEEK,y              =              value,fill up              =              variable))              +              geom_bar(stat              =                              "identity"              )              +              geom_path(aes(y              =              Ratio,color              =                              "Ratio of Delayed flights against Not Delayed Flights"              ))              +              geom_text_repel(aes(characterization              =              Ratio),              size              =              iii)              +              ggtitle(                "Percentage of Flights Delayed"              )              +              labs(ten              =                              "Day of Week"              ,y              =                              "Percentage"              )

Click on Zoom button.

R10

As you can see here, most delays are happening on Monday and Th. It drops during the start of the weekend simply once again rises upwardly by Sunday.

Now we will expect over Destination result on the delays,

Create two new dataframes from delay_flights and non_delay_flights dataframes respectively which will have the count of flights specific to some Destinations like LAX, SFO, HNL, PDX.

              destination_delay_count              <-              agg(group_by(delay_flights,delay_flights              $              DEST),              DELAY_COUNT              =              northward(delay_flights              $              DELAY_LABELED))              destination_delay_count              <-              destination_delay_count[(destination_delay_count              $              DEST              ==                              "LAX"                            |              destination_delay_count              $              DEST              ==                              "SFO"                            |              destination_delay_count              $              DEST              ==                              "HNL"                            |              destination_delay_count              $              DEST              ==                              "PDX"              ) ,]              destination_non_delay_count              <-              agg(group_by(non_delay_flights,non_delay_flights              $              DEST),              NON_DELAY_COUNT              =              n(non_delay_flights              $              DELAY_LABELED))              destination_non_delay_count              <-              destination_non_delay_count[(destination_non_delay_count              $              DEST              ==                              "LAX"                            |              destination_non_delay_count              $              DEST              ==                              "SFO"              )              |              destination_delay_count              $              DEST              ==                              "HNL"                            |              destination_delay_count              $              DEST              ==                              "PDX"                            ,]            

Lets merge these two new dataframes into one.

              destination_count              <-              merge(destination_delay_count,              destination_non_delay_count,              by.destination_delay_count              =              DEST,              by.destination_non_delay_count              =              DEST)              destination_count              $              DEST_y              <-              Nada              destination_count              <-              withColumnRenamed(destination_count,                "DEST_x"              ,                "DEST"              )

And convert information technology into R Dataframe.

              destination_count_r              <-              as.data.frame(destination_count)

Bring up two new columns(Delayed and Ontime) which has the percent values

              destination_count_r              $              Delayed              <-              (destination_count_r              $              DELAY_COUNT              /(destination_count_r              $              DELAY_COUNT              +              destination_count_r              $              NON_DELAY_COUNT))              *              100              destination_count_r              $              Ontime              <-              (destination_count_r              $              NON_DELAY_COUNT              /(destination_count_r              $              DELAY_COUNT              +              destination_count_r              $              NON_DELAY_COUNT))              *              100              destination_count_r              <-              destination_count_r[,-              two              :              -              iii]

Introduce i more column called Ratio which has the proportion of delayed flights against ontime flights on the four aforementioned destinations

              destination_count_r              $              Ratio              <-              destination_count_r              $              Delayed              /              destination_count_r              $              Ontime              *              100              destination_count_r              $              Ratio              <-              round(destination_count_r              $              Ratio,2)

As earlier, permit united states cook downwardly this dataframe too to create a stacked bar chart. Use melt part of reshape package.

              DF2              <-              melt(destination_count_r,              id.var              =                              "DEST"              )              DF2              $              Ratio              <-              DF2[9              :              12,iii]              DF2              <-              DF2[-              9              :              -              12,]              DF2[5              :              eight,4]              <-              NA            

Draw a stacked bar nautical chart:

ggplot(DF2, aes(x              =              DEST,y              =              value,fill              =              variable))              +              geom_bar(stat              =                              "identity"              )              +              geom_path(aes(y              =              Ratio,color              =                              "Ratio of Delayed flights confronting Not Delayed Flights"              ),group              =              1)              +              geom_text_repel(aes(label              =              Ratio),              size              =              iii)              +              ggtitle(                "Pct of Flights Delayed past Destination"              )              +              labs(x              =                              "Destinations"              ,y              =                              "Percentage"              )

R11

Looks like smaller city Destination has the most delayed ratio. Let us practice the aforementioned thing with Origins also. Create ii new dataframes having records only where Origins are SNA, ORD, JFK and IAH.

              origin_delay_count              <-              agg(group_by(delay_flights,delay_flights              $              ORIGIN),              DELAY_COUNT              =              north(delay_flights              $              DELAY_LABELED))              origin_delay_count              <-              origin_delay_count[(origin_delay_count              $              ORIGIN              ==                              "SNA"                            |              origin_delay_count              $              ORIGIN              ==                              "ORD"                            |              origin_delay_count              $              ORIGIN              ==                              "JFK"                            |              origin_delay_count              $              ORIGIN              ==                              "IAH"              ) ,]              origin_non_delay_count              <-              agg(group_by(non_delay_flights,non_delay_flights              $              ORIGIN),              NON_DELAY_COUNT              =              due north(non_delay_flights              $              DELAY_LABELED))              origin_non_delay_count              <-              origin_non_delay_count[(origin_non_delay_count              $              ORIGIN              ==                              "SNA"                            |              origin_non_delay_count              $              ORIGIN              ==                              "ORD"                            |              origin_delay_count              $              ORIGIN              ==                              "JFK"                            |              origin_delay_count              $              ORIGIN              ==                              "IAH"              ) ,]

Merge dataframes by using merge function of SparkR API and catechumen it into R Dataframe:

              origin_count              <-              merge(origin_delay_count,              origin_non_delay_count,              by.origin_delay_count              =              ORIGIN,              by.origin_non_delay_count              =              ORIGIN)              origin_count              $              ORIGIN_y              <-              NULL              origin_count              <-              withColumnRenamed(origin_count,                "ORIGIN_x"              ,                "ORIGIN"              )              origin_count_r              <-              as.data.frame(origin_count)

Add three columns - Delayed(Percentage), Ontime(Percentage) and Ratio(Delayed/Ontime)

              origin_count_r              $              Delayed              <-              (origin_count_r              $              DELAY_COUNT              /(origin_count_r              $              DELAY_COUNT              +              origin_count_r              $              NON_DELAY_COUNT))              *              100              origin_count_r              $              Ontime              <-              (origin_count_r              $              NON_DELAY_COUNT              /(origin_count_r              $              DELAY_COUNT              +              origin_count_r              $              NON_DELAY_COUNT))              *              100              origin_count_r              <-              origin_count_r[,-              2              :              -              iii]              origin_count_r              $              Ratio              <-              origin_count_r              $              Delayed              /              origin_count_r              $              Ontime              *              100              origin_count_r              $              Ratio              <-              round(origin_count_r              $              Ratio,2)

Every bit before, make the dataframe in long format using melt() and draw the stacked bar chart:

              DF3              <-              cook(origin_count_r,              id.var              =                              "ORIGIN"              )              DF3              $              Ratio              <-              DF3[9              :              12,three]              DF3              <-              DF3[-              nine              :              -              12,]              DF3[5              :              8,4]              <-              NA              ggplot(DF3, aes(x              =              ORIGIN,y              =              value,fill up              =              variable))              +              geom_bar(stat              =                              "identity"              )              +              geom_path(aes(y              =              Ratio,color              =                              "Ratio of Delayed flights against Non Delayed Flights"              ),group              =              1)              +              geom_text_repel(aes(label              =              Ratio),              size              =              3)              +              ggtitle(                "Per centum of Flights Delayed by Origin"              )              +              labs(10              =                              "Origins"              ,y              =                              "Percentage"              )

R12

As you tin can come across here, smaller city Origin(SNA) has a least filibuster ratio.

Summary

Congratulations, yous now know how to use SparkR to explore, transform, and visualize information to gain valuable insight. In the futurity we volition expand this tutorial with prediction steps leveraging several popular supervised learning models.

Further reading

If you want to learn more near using Apache Spark for auto learning and processing large datasets delight cheque out these tutorials and videos:

  • Easily-On Bout of Apache Spark in five Minutes
  • Intro to Automobile Learning with Spark
  • Information Science Starter Kit
  • Installing RStudio on HDP Sandbox

walkuphaterse.blogspot.com

Source: https://github.com/hortonworks/data-tutorials/blob/master/tutorials/hdp/advanced-analytics-with-sparkr-in-rstudio/tutorial.md

0 Response to "How to Read Dataset Into Rstudio on Mac"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel