We Have Numbers Of Free Samples
For Each Subject To Make A Difference In Your Grade
Anova – Data Analytics
Total Views 707
Abstract
Data analytics field enables decision makers to better understand data, detect trends, explore relationships and drive insights that will be useful to take an efficient decision In future from historical data. Data analytics is the process of analyzing a specific data which involves various manipulations on the selected data sets to draw meaningful conclusions.
The aim of this paper is to present some of the more important statistical techniques and highlight their uses based on two case studies in accidents and crime data.
Introduction
Statistics apply mathematics disciplines to analyze data. Most domains use data analysis to planning, analysis and interpret experimental work or historical data, especially for critical topics like crime and accident. Various existing studies have explored this type of data and look at the association between variables that capture attention to aid interpretation and presentation of results. Use the appropriate statistical method is essential to assess and make inferences in these critical topics; for example : the evaluation of the variation of Traffic Volume and Number of Accidents per Hour or the evaluation of the impact of round specificities in accident severity.
There is various statistic techniques and methods proposed in the data analysis field. Descriptive statistics and inferential statistics are among the main statistical branches of statistics which are employed in scientific analysis of data and both are equally important for researchers in statistics.
Descriptive statistics summarizes data from a population sample using indicators like mean, median, minimum, maximum values, standard deviation and/or coefficient of variation. This type of statistics deals with the capture and presentation of data and often constitutes the first part of a statistical analysis. The statistician needs to be aware of designing experiments in this step, choosing the right focus group and avoid biases.
The mean, or more precisely the arithmetic mean, is simply the arithmetic average of a group of numbers (or data set) and is shown using -bar symbol {displaystyle {bar {}}}. So the mean of the variable {displaystyle x}x is {displaystyle {bar {x}}}pronounced “x-bar”. It is calculated by adding up all of the values in a data set and dividing by the number of values in that data set.
However, inferential statistics allow us to draw the right conclusions from the statistical analysis that has been performed using descriptive statistics. Among the main used techniques we found Hypotheses testing and ANOVA.
Two characteristics of data sets must be considered prior to the application of any inferential tests :1-Do the data conform to the principles of normality . ie normal distribution of data (ND)? 2- Do the data satisfy an assumption of homoscedasticity, i.e. uniformity of variance?
Most predictions of the future and generalizations of results to averall population using a smaller sample come under the umbrella of inferential statistics. By designing the right experiment and using the right techniques, the researcher is able to draw relevant conclusions to his study.
Hypothesis Testing is a statistical hypothesis is an assumption made by the researcher about the population of data collected for any experiment. This assumption can be true or false. Hypothesis testing is, the formal way of validating the hypothesis made by the researcher and check whether the hypothesis is accepted or rejected.
In order to validate a hypothesis, it will consider the entire population which is not possible in practice. So, to validate a hypothesis, we use random samples from a population and on the basis of the result from testing over the sample data, we can either select or reject the hypothesis.
Statistical Hypothesis can be categorized into 2 types as below:
- Null Hypothesis –Hypothesis tests are used to test the validity of a claim that is made about a population. This claim that’s on trial, in essence, is called the null hypothesis. The null hypothesis testing is denoted by H0.
- Alternative Hypothesis –The alternative hypothesis is the one you would believe if the null hypothesis is concluded to be untrue. The evidence in the trial is your data and the statistics that go along with it. The alternative hypothesis testing is denoted by H1or Ha.
Hypothesis testing uses a p-value to evaluate the strength of the evidence. The p-value is a number between 0 and 1 and interpreted in the following way:
- A small p-value (typically ≤0.05) indicates strong evidence against the null hypothesis, so you reject it.
- A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
- A p-value very close to the cutoff (0.05) is considered to be marginal and could go either way.
The Analysis of Variance (ANOVA)
Analysis of variances (ANOVA) is a parametric statistical technique that partitions the observed variance into components that arise from different sources of variation. In its simplest form, ANOVA technique provides a statistical test of whether or not the means of several groups are all equal. In this sense, the null hypothesis, H0, says there are no differences among results from different treatments or sample sets; the alternative hypothesis (Ha) is that the results do differ. If the null hypothesis is rejected then the alternative hypothesis, Ha, is accepted, i.e. at least one set of results differs from the others.
The ANOVA technique is generally used to compare the mean values of three or more data sets.
This study focus on the three kinds of statistic methods frequently used in this field as noted above, to practice and apply in two different case studies.
The purpose of this document is to present and apply three statistic techniques, then discuss the outcome of applying them to our datasets.
The reminder of this paper presents two case studies. In each case study, it presents the dataset structure, applies at least 2 statistical techniques on it and discuss the results of using these techniques and if one supported more than others.
1. Case study 1 : UK accidents data
Road accident is most unwanted thing to happen to a road user, over speeding, Red light Jumping or Avoiding Safety Gears like Seat belts and Helmets are some reasons of the common behaviors that generate fatal accidents and increase the degree of the severity of accident.
The United States is one of the busiest countries in terms of road traffic. The level of traffic is one of the reasons leading to more traffic accidents: In 2015, there were some 6.3 million fatal, injury, and property damage crashes that occurred in the U.S. alone.
The UK government amassed traffic data from 2000 and 2016, recording over 1.6 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It’s a huge picture of a country undergoing change.
Data analysis of accidents data is the process of analyzing accidents data sets to derive useful conclusions and/or informations based on the approach presented in previously. We propose to analyze the impact of Junction control types at the number of vehicles impacted in accident using the explanatory variable Junction_control and check if there is a relation in the mean difference between junction control types. This case study uses three statistic techniques : mean analysis, Hypothesis testing and Analysis of Variance(ANOVA).
This case study purpose is to check if there is a statistically significant difference in variable “number of vehicles” between the two groups of “junction control” : “Giveway or uncontrolled” “Stop Sign“ in the accidents data.
- Null Hypothesis (H0) : Junctio control of type “Giveway or uncontrolled” has the same impact like Junction control of type “Stop Sign “
- Alternative Hypothesis : Junctio control of type “Giveway or uncontrolled” has not the same impact like Junction control of type “Stop Sign “
1.1. Data Preparation and Cleaning
All the contained accident data come from police reports, so this data does not include minor incidents. “UK road accidents from 2012 to 2014” file Contains data of UK accidents from 2012 to 2014.
Figure 1 : snapshot of accident data
Data variables:
-
- Unnamed: 0
- Accident_Index
- Location_Easting_OSGR
- Location_Northing_OSGR
- Longitude
- Latitude
- Police_Force
- Accident_Severity
- Number_of_Vehicles
- Number_of_Casualties
- Date
- Day_of_Week
- Time
- Local_Authority_(District)
- Local_Authority_(Highway)
- 1st_Road_Class
- 1st_Road_Number
- Road_Type
- Speed_limit
- Junction_Detail
- Junction_Control
- 2nd_Road_Class
- 2nd_Road_Number
- Pedestrian_Crossing-Human_Control
- Pedestrian_Crossing-Physical_Facilities
- Light_Conditions
- Weather_Conditions
- Road_Surface_Conditions
- Special_Conditions_at_Site
- Carriageway_Hazards
- Urban_or_Rural_Area
- Did_Police_Officer_Attend_Scene_of_Accident
- LSOA_of_Accident_Location
- Year
The data analysis is conducted using R programming language. The main R libraries used for this analysis are ggplot2, readr and data.table.
After loading the main libraries, the R script checks the main statistics of each variable using the summary function. The results show that some variables and lines are not fully containing data which require cleaning operation and removing remaining rows with NAs values.
1.2. Exploratory data analysis
For Junction_Control variable is a categorical variable with exactly 178610 effective values of 4 categories as presented in the table below :
summary(accidents1214$Junction_Control) | |||||
Junction control | Giveway or uncontrolled | Automatic traffic signal | Stop Sign | Authorised person | NA |
Number | 232915 | 50208 | 2287 | 677 | 286087 |
Percentage | 40.7% | 8.8% | 0.4% | 0.1% | 50.0% |
286087 represents the NA values that we will remove from data in the next sections. Giveway or uncontrolled group have the top number of accidents by 40.7% and the stop sign group has the lowest with 0.4%
Figure 2 : box plot of number of vehicles by Junction control groups
The figure 2 presents the side by side box plot of junction control by number of vehicles. It summarizes a data set using important statistics (first quartile, median, third quartile and upper whisker) for comparing data across groups. This figure illustrates the repartition of number of vehicles in each group and shows that “Giveway or Uncontrlled” group produces a big number of impacted vehicles(max of 11 vehicles) than the other groups. The “Authorised person” group generates a maximum of 5 impacted vehicles.
Figure 3 : Number of vehicles histogram
Figure 3 presents the number of vehicles histogram that illustrates the density of the variable Number of vehicles. Most of accidents are registered for 2 vehicles (198105 accidents), followed by 1 vehicle (65251 accidents). We note also the presence of some outliers (59 accidents), an accident of 11 vehicles in group” Giveway or uncontrolled”.
1.3. Descriptive statistics using Mean
The mean, or more precisely the arithmetic mean, is simply the arithmetic average of a group of numbers (or data set) and is shown using -bar symbol {displaystyle {bar {}}}. So the mean of the variable {displaystyle x}x is {displaystyle {bar {x}}}pronounced “x-bar”. It is calculated by adding up all of the values in a data set and dividing by the number of values in that data set.
We propose to analyze the mean of Number_of_Vehicles using the independent variable Junction_control and check if there is a relation in mean difference between junction control types as presented in the table below.
meanJunction_Control2# Create a bar plot of the Number_of_Vehicles
barplot(meanJunction_Control2$V1, ylim=c(0,3), names.arg=meanJunction_Control2$Junction_Control, main=”Number_of_Vehicles for Different Junction_Control”, xlab=”Junction_Control”, ylab=”Average Number_of_Vehicles (out of 3)”, xpd=FALSE)
Figure 4 : Number of vehicles per Junction control
Table 1 : mean and standard deviation of number of vehicles
Estimate parameter | Authorised person | Automatic traffic signal | Automatic traffic signal | Giveway or uncontrolled | Stop Sign |
Mean | 1.762813 | 1.748892 | 1.857214 | 1.870863 | 1.953214 |
Standard deviation | 0.8570673 | 0.6404876 | 0.5964850 | 0.5938531 | 0.5324114 |
From Figure 4 and table 1, the calculated mean of “number of vehicles” are respectively 1.87 and 1.95 for “Giveway or uncontrolled” group and “Stop Sign” group. “Stop Sign” group is slightly bigger than “Giveway or uncontrolled” group. Even if the max of impacted vehicles is the highest in the group “Giveway or uncontrolled” the mean of “Stop Sign” group is the highest.
1.4. Hypothesis Testing
Hypothesis testing is, the formal way of validating the hypothesis defined above and check whether the hypothesis is accepted or rejected. Hypothesis testing uses a p-value to evaluate the strength of the evidence.
This section’s purpose is to check if there is a statistically significant difference in number of vehicles between junction control of type “Giveway or uncontrolled” and Junction control of type “Stop Sign“ in the accidents data :
- Null Hypothesis (H0) : Junctio control of type “Giveway or uncontrolled” the same impact like Junction control of type “Stop Sign “
- Alternative Hypothesis : Junctio control of type “Giveway or uncontrolled” has not the same impact like Junction control of type “Stop Sign “
Group | Giveway or uncontrolled | Stop Sign |
Number | 232915 | 2287 |
Percentage | 40.7% | 0.4% |
Apply Student’s t-test in R using accidents data
The Student’s t-test is a method for comparing two samples. It can be implemented to determine whether the samples are different. This is a parametric test, and the data should be normally distributed. R can handle the various versions of t-test using the t.test() command. The test can be used to deal with two- and one-sample tests as well as paired tests.
“`{r}
#Using the Student’s t-test in R accidents1214_2types=accidents1214[Junction_Control==’Giveway or uncontrolled’ | Junction_Control==’Stop Sign’] t.test(accidents1214_2types$Number_of_Vehicles ~ accidents1214_2types$Junction_Control) |
Results :
Welch Two Sample t-test data: accidents1214_2types$Number_of_Vehicles by accidents1214_2types$Junction_Controlt = -7.3522, df = 2342.2, p-value = 2.681e-13 |
The P-value of t-test statistic for mean comparison of number of vehicles committed in “Giveway or uncontrolled” and “Stop Sign” is less than the specify α-level (0.05). Hence the number of impacted vehicles in “Giveway or uncontrolled” and “Stop Sign” had significant different.
1.5. ANOVA
This section’s purpose is to check using ANOVA technique, if there is a statistically significant difference in number of vehicles between junction control of type “Giveway or uncontrolled” and Junction control of type “Stop Sign“ in the accidents data
The P-value analysis of variance for the number of vehicles occurred in different kind of Junction control less than our α-level (0.05) of significant. Then, analysis concluded that the type of Junction control had significant effect on the number of impacted vehicles.
l<-aov(Number_of_Vehicles ~ Junction_Control, data = accidents1214)
summary(l); model.tables(l) plot(l) |
Results :
Df Sum Sq Mean Sq F value Pr(>F) Junction_Control 3 34 11.254 31.9 <2e-16 *** Residuals 286083 100928 0.353 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Tables of effects
Junction_Control Authorised person Automatic traffic signal Giveway or uncontrolled Stop Sign -0.1199 -1.162e-02 2.026e-03 8.438e-02 rep 677.0000 5.021e+04 2.329e+05 2.287e+03 |
The p(<2e-16 ) value is lower than 0.05. The result of the global comparison is thus obtained: the effect of Junction_Control is significant at the 0.001 threshold (F = 625.2).
The model.tables (l) function returns the sizes of effect, ie, here, the differences between each group and the overall average (Note that this result could have been obtained by tapply function too.
The plot (l) function provides different graphs diagnosing the validity of the model (homogeneity of intra-group variances, normal distribution of residues,etc …) among them, we illustrate below : Fitted values versus Residuals plot and Normal –Q plot.
plot(l) |
![]() |
![]() |
Discussion
The aim of this first part of the study was to identify the effect of the Junction control on the number of vehicles, in road accidents at US based on data obtained from the police records of the accident cases. Methods used for data analysis were hypothesis testing using t-test and ANOVA.
Both methods hypotheses testing using t-test and ANOVA confirm that the number of impacted vehicles of road traffic accidents is associated with the type of junction control, especially between “Giveway or uncontrolled” and “Stop Sign”. “Stop Sign” group mean is slightly greater than “Giveway or uncontrolled” group. Even if the max of impacted vehicles is the highest in the group “Giveway or uncontrolled” the mean of “Stop Sign” group still the highest.
Conclusion
The overall findings of this study indicate that the number of impacted vehicles of road traffic accidents was associated with the type of junction control, especially between “Giveway or uncontrolled” and “Stop Sign”. Even if there is a high rate in impacted number of vehicles in group “Giveway or uncontrolled” the mean of of “Stop Sign” still slightly the highest. The t-test and ANOVa method used in this case study have revealed a difference in means between these groups, however, the degree of impact and the degree of impact is not clear. There is another statistics techniques like regression or decision trees that can help to identify the type and degree of impact
1. Case study 2 : US crime data
1.1. Data preparation and cleaning
“Crime_Data_2010_2017.csv” file Contains data of crimes from 2010 to 2017 with 26 variables and 1584316 observations. Below is a snapshot of the selected dataset.
The data analysis of the second case study is conducted using R programming language too. The main R libraries used for this analysis are ggplot2, readr and data.table. To explore data and check main statistics of each variables we have used the R summary function.
summary(crimes) | |||
DR Number | Date.Reported | Date.Occurred | Time.Occurred |
Min. : 210 | 08/07/2017: 804 | 01/01/2010: 2135 | Min. : 1 |
1st Qu.:112117618 | 07/24/2017: 782 | 01/01/2011: 2050 | 1st Qu.: 930 |
Median :140109488 | 07/05/2016: 760 | 01/01/2012: 1641 | Median :1430 |
Mean :135807195 | 07/17/2017: 757 | 01/01/2013: 1455 | Mean : 1364 |
3rd Qu.:152018611 | 05/22/2017: 754 | 01/01/2014: 1319 | 3rd Qu.:1900 |
Max. :910220366 | 06/05/2017: 749 | 01/01/2015: 1240 | Max. :2359 |
(Other) :1579710 | (Other) :1574476 |
Area.ID | Area.Name | Reporting.District | Crime.Code |
Min. : 1.00 | 77th Street: 110605 | Min. : 100 | Min. :110 |
1st Qu.: 6.00 | Southwest : 102259 | 1st Qu.: 645 | 1st Qu.:330 |
Median :12.00 | N Hollywood: 86405 | Median :1204 | Median :440 |
Mean :11.15 | Pacific : 83763 | Mean :1162 | Mean :507 |
3rd Qu.:16.00 | Southeast : 83517 | 3rd Qu.:1676 | 3rd Qu.:626 |
Max. :21.00 | Mission : 80249 | Max. :2198 | Max. :956 |
(Other) :1037518 |
Crime.Code.Description | MO.Codes | Victim.Age |
BATTERY – SIMPLE ASSAULT:145767 | 0344 : 173902 | Min. :10.00 |
VEHICLE – STOLEN :121329 | 171759 | 1st Qu.:23.00 |
BURGLARY FROM VEHICLE :121318 | 0329 : 68728 | Median :34.00 |
BURGLARY :114751 | 1501 : 34504 | Mean :35.93 |
THEFT PLAIN – PETTY ($950 & UNDER) : 113709 | 0416 : 23997 | 3rd Qu.:48.00 |
THEFT OF IDENTITY :100653 | 0325 : 21546 | Max. :99.00 |
(Other) :866789 | (Other) :866789 | NA’s :128659 |
Victim.Sex | Victim.Descent | Premise.Code |
:145199 | H :549515 | Min. :101.0 |
-: 1 | W :391855 | 1st Qu.:102.0 |
F:675402 | B :255056 | Median :210.0 |
H: 53 | O :152776 | Mean :312.4 |
M:739581 | :145232 | 3rd Qu.:501.0 |
X: 24080 | X : 41535 | Max. :971.0 |
(Other): 48347 | NA’s :76 |
Premise.Description | Weapon.Used.Code |
STREET :352160 | Min. :101.0 |
SINGLE FAMILY DWELLING :328198 | 1st Qu.:400.0 |
MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC):204980 | Median :400.0 |
PARKING LOT :112576 | Mean :370.6 |
SIDEWALK : 79247 | 3rd Qu.:400.0 |
OTHER BUSINESS : 71097 | Max. :516.0 |
(Other) :436058 | NA’s :1059559 |
Weapon.Description | Status.Code |
:1059560 | IC :1227180 |
STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE): 319818 | AO : 178175 |
VERBAL THREAT : 43814 | AA : 162424 |
UNKNOWN WEAPON/OTHER WEAPON : 40746 | JA : 12619 |
HAND GUN : 25352 | JO : 3889 |
SEMI-AUTOMATIC PISTOL : 10096 | CC : 24 |
(Other) : 84930 | (Other) : 5 |
Status.Description | Crime.Code.1 | Crime.Code.2 | Crime.Code.3 |
Adult Arrest: 162424 | Min. : 110.0 | Min. : 121.0 | Min. : 93.0 |
Adult Other : 178175 | 1st Qu. : 330.0 | 1st Qu.: 998.0 | 1st Qu. : 998.0 |
Invest Cont :1227180 | Median : 440.0 | Median : 998.0 | Median : 998.0 |
Juv Arrest : 12619 | Mean : 506.9 | Mean : 954.5 | Mean : 970.7 |
Juv Other : 3889 | 3rd Qu.: 626.0 | 3rd Qu.:998.0 | 3rd Qu. : 998.0 |
UNK : 29 | Max. : 999.0 | Max. : 999.0 | Max : 999.0 |
NA’s : 7 | NA’ : 1484319 | NA’s : 1582133 |
Crime.Code.4 | Address |
Min. : 421.0 | 6TH ST : 3692 |
1st Qu.: 998.0 | 7TH ST : 2793 |
Median : 998.0 | 9300 TAMPA AV : 2724 |
Mean : 967.2 | 6TH : 2231 |
3rd Qu.: 998.0 | 5TH ST : 2230 |
Max. : 999.06600 | TOPANGA CANYON BL: 2226/td> |
NA’s : 1584247 | (Other) : 1568420 |
Cross.Street | Location |
:1321583 | (0, 0) : 5482 |
BROADWAY : 4662 | (34.1016, -118.3387): 2915 |
FIGUEROA : 2830 | (34.1905, -118.6059) : 2202 |
VERMONT AV: 2776 | (33.9892, -118.3089) : 1881 |
WESTERN AV: 2641 | (34.1576, -118.438) : 1681 |
SAN PEDRO : 2616 | (34.2216, -118.4488) : 1627 |
(Other) : 247208 | (Other) : 1568528 |
After loading the main libraries, the R script checks the main statistics of each variable using the summary function. The results show that some variables and lines are not fully containing data which require cleaning operation and removing remaining rows with NAs values using crimes= na.omit(crimes) script.
1.2. Exploratory data analysis
Victim.Age is a numeric discrete variable which presents the age of a victim in each crime. The main statistics of this variable are presented in the table below. The median of victim age is 34 years, but the mean 35.93 is slightly greater than the median. The Minimum value is 10 years old and the maximum value is 99 years old.
summary(crimes$Victim.Age) | |||||
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
10.00 | 23.00 | 34.00 | 35.93 | 48.00 | 99.00 |
table(crimes$Victim.Age) | |||||
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24852 25812 25698 27806 32816 38223 31918 12842 17211 22570 26895 30078 32671 3513124 252627 28 29 30 31 32 33 34 35 36 37 36096 36939 35693 35097 34517 34323 33686 31644 31194 30281 30022 30367 28006 26623383940 41 42 43 44 45 46 47 48 49 50 51 26082 25900 26863 24747 24753 24689 23981 24228 23969 23484 22566 22166 22852 2123152 5354 55 56 57 58 59 60 61 62 63 64 65 20859 20077 18965 18244 17260 15972 14878 14111 12809 11575 10904 9948 9146 857566 67 68 6970 71 72 73 74 75 76 77 78 79 7551 6915 6043 5742 5130 4388 4165 3606 3257 3208 2724 2535 2366 217880 81 82 83 84 85 86 8788 89 90 91 92 93 2082 1754 1639 1484 1374 1214 1071 923 861 696 620 481 379 34394 95 96 97 98 9994 95 96 97 98 99 257 183 126 98 66 353 |
Moreover, if we observe the frequency table of this discrete variable “victim age”, we found that 15 years old is the top age with 38223 crimes in this population.
Figure 6 : frequency of victim.age variable
We proposed to plot victim.age data frequencies in figure 6, then observe the distribution of data and the means. The histogram, presented below, of Victim.age data shows bimodal, normal distribution that is right skewed with a pic in 10 and 25 years intervals .
For Area.Name categorical variable, the frequency table presents 42 categories or name of areas concerned by this dataset. The four top area has 105752 crimes in comparison to the rest of areas:
table(crimes$Victim.Age) | |||||
77th Street Central Devonshire Foothill Harbor Hollenbeck Hollywood 105752 62170 67001 55248 65438 49886 64378Mission N Hollywood Newton Northeast Olympic Pacific Rampart 73425 83070 66062 68282 60997 74260 63765Southeast Southwest Topanga Van Nuys West LA West Valley Wilshire 79269 97839 69635 67869 61861 59357 60093 |
We propose to plot victim.age data distribution and for each area.name and observe the distribution of data and the means.
Figure 7 : distibution of victim.age in area names
The figure 7 illustrates the distribution of victim ages by area name using side by side box plot. The figure presents a clear difference in means between areas and that “77th Street” area name presents the highest number crimes.
1.3. Descriptive statistics using Mean
We propose to analyze the mean of the variable victim age using the explanatory variable area.name and check if there is a relation in the mean difference between area.name.
mean_c = crimes[, mean(Victim.Age), by=Area.Name]
mean_c# Create a bar plot
barplot(mean_c$V1, ylim=c(10,60), names.arg=mean_c$Junction_Control, main=”victim age for Different area”, xlab=”areas”, ylab=”Average victim age (out of 3)”, xpd=FALSE)
Figure 8 : victim age means for different area names
Figure 8 illustrates using an histogram the distribution of mean by area name. “West LA” has the highest age mean by 40.51 years and “Newton” has the lowest with 32.12 years
mean <- tapply(crimes$Victim.Age,crimes$Area.Name,mean)
mean
barplot(tapply(crimes$Victim.Age,crimes$Area.Name,mean))
This observed main difference in victim age can be explained by other variables or demographic data of the area region. In the next sections, the document presents the application of t-test and ANVA methods to analye the impact of each area of crime in victim age.
1.4. Hypothesis testing
This section’s purpose is to check if there is a statistically significant difference in victim.age between area of crime “Newton” and area of crime “West LA“ in the crime data:
Figure 9: side by side box plot of victim age for 2 areas of crime
Figure 9 shows the side by side plot of victim age from “Newton” area and Mission area. This graph presents a difference in means between the two areas. “West LA” has the sighest mean of 40.51 years versus 32.12 years old.
- Null Hypothesis (H0) : the group with the area name “West LA” has the same impact as the group of area name “Newton “
- Alternative Hypothesis : the group with the area name “West LA” has not the same impact as the group with the area name “Newton“
The P-value of t-test statistic for mean comparison of victim age in crime between “Newton” and “West LA” is less than the specify α-level (0.05). Hence the victim age has significant difference.
1.5. ANOVA
This section’s purpose is to check using ANOVA technique, if there is a statistically significant difference in crime victim.age variable for each name areas.
- Null Hypothesis (H0) : the group with the area name “West LA” has the same impact as the group of area name “Newton “
- Alternative Hypothesis : the group with the area name “West LA” has not the same impact as the group with the area name “Newton“
The P-value analysis of variance for the area name of crime occurred in different kind of area name is lower than our α-level (0.05) of significant. Then, the analysis concluded that the victim age has significant effect on the victim.age.
The result of the global comparison is thus obtained: the effect of area.name variable in victim age is significant at (F = 7957).
The model.tables (l) function returns the sizes of effect, ie, here, the differences between each area name and the overall average.
The plot (l) function provides different graphs diagnosing the validity of the model (homogeneity of intra-group variances, normal distribution of residues,etc …)
plot(l) |
![]() |
![]() |
Discussion
The aim of this second part of the study was to identify the effect of the area name on the victim age in crime dataset, in road accidents at US based on data obtained from the police records of the accident cases. Methods used for data analysis were hypothesis testing using t-test and ANOVA. The results of t-test and ANOVA confirm that is a statistically significant difference in victim.age between areas. “West LA” area has the highest victim age mean by 40.51 years and “Newton” has the lowest with 32.12 years.
Conclusion
The results show that there is a difference in victim age means (statistically significant) between areas, especially between “West LA” and “Newton”. Victim age of crimes is related to the area characteristics, which can be studied from different perspectives. The study of the demographic structure, the population or the economic conditions of the areas can help to reveal the root cause of this relation and then the factors that impact the victim age.
Referencing
http://www.sthda.com/french/wiki/anova-analyse-de-variance-avec-r
https://www.kaggle.com/cityofLA/crime-in-los-angeles
https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales
https://explorable.com/branches-of-statistics
http://www.acaresiran.com/Content/Articles/17/ArticleFiles/17/File.pdf
BookMyEssay hires best Australian writers who are well-versed with the guidelines and requirements of the case study writing, homework writing dissertation or assignment writing task. They pledge to prepare the content in close compliance with university guidelines and help the students to secure good grades in their exams which can be crucial landmark in their academic career.
Download
505
Size
140.91 KB
File Type