Wednesday, April 3, 2019

Using RStudio to Prepare and Clean Data

Using RStudio to Prep are and Clean DataThere is now more(prenominal) selective information avai science laboratoryle than ever before, the depth and scope is increasing daily. The plosion of the internet and connected devices has increased this and bigger info is now big business. With the increase in data available to us, so has the need for summary of this data. Many companies employ this data to predict future trends. Also, what has changed is the tools we use to analytic thinking and present this data in a meaningful way.In the chivalric statistical software was very expensive and often with no graphical capabilities. Enter the R programming language a tool that supports both, first-year released in 1995 with the first s board build in 2000, now on version 3 which was released in 2013. R is a free kick in origin project with over 7000 add on packages available. Many companies such as Google and Facebook are employ R for their data analysis.In this lab book we pull up stakes look at cleaning and preparing data so it goat analysed. We will use R Studio which is an IDE (Integrated development environment) for the R programming language. R Studio is available as an open source or commercial version, it has twain editions R Studio desktop and R Studio Server and runs on Windows, macOS and Linux operating systems.The dataset we have is from the UK government, and is based on MOT outlets in England, Scotland, and Wales, it contains data such as name, address, post codes, r onlyy tot ups and categories of vehicles time-tested. On quick analysis of the dataset there are a lot of blank fields, extra sporting spaces, typos in the shout tugboat as well as second telephone numbers obscure by the / symbol.Using R Studio we will attempt to nice and clean the dataset. In this lab book we will explain the miscellaneous restraints and techniques used to prepare the data for analytical analysis.Make a feign of the data to work withMethod Here we install a transcript of the original dataset x2016motsitelist and call it bon motList, this is good practice as you will non contaminate the original dataset. tally dissolver From the supra screen gunman you can see we have renamed our dataset to MotList, by using the name of the dataset in R studio it lists the dataset in a screen dump on the comfort.Get the Stucture of Our Data FrameMethod by using the str() assure in the console we get the structure of our data. scrutiny firmness by using the structure see to it str() we can see that our dataset has 22,980 objects and 14 antithetical variables. The next lines which contain $ indicate towboat headings and display some of the components include in these columns. This subordination just provides a list with components and names.View the dataMethod Using the Head command to view the data.Test effect using this command the first 6 records are displayed in the console window.ID names of columnsMethod We use the Names command t o display column names.TestResult this displays the names of our columns in the console window.Summary of what is contained in the columnsMethod we use the Summary command to get an overview of the data in our columns.TestResult the summary command gives us an overview for every transmitter in the data frame, tells us in our case that the length is 22980 rows, that all vectors are character classes.Missing revaluesMethod we will use the is.na command, combination of is.na with the any command and lastly the sum command to sum up for lacking values in the data.TestResult the upshot of the is.na command returns a Boolean true or false result on the data set to tell us if a lacking value is present or not.TestResult with the use of the any command we find that there is indeed missing data in the dataset.TestResult with the use of the sum command we get the number of missing records, which is 149097 in this case.Rename columns in our data set.Method we use the colnames command to c hange the columns in our data set that are numbered 1,2,3,4,5 and 7.TestResult with the use of the above commands we change the name of the columns using the name to secernate which column to apply the name change to. We use the names(MotList) to verify the result.TestRemove NA from the diametrical categories of vehicle that are Mot testedMethod we create another copy of our dataset and call it MotListMod, on this dataset we will change the NA values in the columns that we renamed earlier so that the different categories of vehicles tested will have complete values and no missing data. We do this by giving the dataset name and consequently the $ column name, we then use the which command and is.na to change the value to the want result.TestResult As can be seen from the screen guessing above, we have changed the NA values in the six columns of our dataset, our dataset now tells us if a Mot test centre carries out tests on the different vehicle categories Y or N, were as before it only told us the if the centre did Y with a blank field for N. Again, we run the sum is.na command on both datasets we have, now the MotListMod dataset has far less Nas in the dataset.Remove and enlighten up VTS Telephone column.Method Firstly, using the GSUB command we removed instances of Tel. and TEL. from our column, secondly, we identify the column in two sections number 1 and 2 with the distinguish command as some of the test centres have two telephone numbers separated by / in the dataset, thirdly we tidy up the etiolate space.TestUsing GSUB wrongly above didnt clear the desired outcome, but in the two screens below we get the desired outcome.The above screen shows were the VTS Telephone column is split into different sections.Trimming white space from the front of the telephone numbers.Removing the NAs from the VTS Telephone number2Result By using GSUB and identifying the column we wanted to target, we replaced the instances of Tel. and TEL. in our dataset with whi tespace, we then proceeded to split the column into two different sections, when we did this it created a lot of NAs in the second column because not every test centre has two telephone numbers, so to answer act this we replace the NAs with the value 0. We then tidy up the white space at the start of the two columns.Write to CSV commit in R studioMethod We will write the MotListMod3 dataset to a CSV file with the WRITE.CSV command.TestResult The above command writes the dataset to a csv file and can be viewed or shared with others, see above screen cam stroke of the file in excel.Outliers and plot function.Method using the HIST command we produce a histogram of the cars column, the columns class had to be changed to a factor form to make the function work, also we used the Table command to count the number of Y and N in this column.TestIn the screen shot above you can see a histogram of the cars column.Result No outliers are present as our columns only have a Y or N present in th e different type of vehicles tested columns. Also, our data was of class character, this had to be converted to a factor form so as we could use the histogram function on the column cars. We used the table command on the column to display a numeric result for N = 1054 and Y=21926.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.