Surival guide

Add to Favourites
Post to:

Data exploration Exercise 1; Loyn.xls Open the data set and import the data into R. The variable ABUND is the density of birds in 56 forest patches. The explanatory variables are size of the forest patches (AREA), distance to the nearest forest patch (DIST), distance to the nearest larger forest patch (LDIST), year of isolation of the patch (YR.ISOL), agricultural grazing intensity at each patch (GRAZE) and altitude (ALT). The underlying aim of the research is to find a relationship between bird densities and the explanatory variables. In this exercise we start with a data exploration. Save all your results in a Word document as you need them later. 1. Identify the response and explanatory variables. 2. There is one nominal explanatory variable. Which one? 3. Are there outliers in the response variable (Birds)? 4. Are there outliers in the explanatory variables? 5. Is there collinearity? 6. Are there any clear relationships between response and individual explanatory variables? 7. Based on biological common sense, do you think there will be interactions? Explore the data for interactions. 8. What do you think will be the final outcome of this analysis? Exercise 2; wedgeclamII.xls Open the data set and import the data into R. The variable AFD is biomass of 399 wedclams. It is the response variable. We are going to model this as a function of length (LENGTH) and time (MONTH). The explanatory variables are LENGTH and MONTH. In this exercise we start with a data exploration. Save all your results in a Word document as you need them next week. 1. Identify the response and explanatory variables. 2. There is one nominal explanatory variable. Which one? 3. Are there outliers in the response variable (AFD)? 4. Are the outliers in the explanatory variables? 5. Is there collinearity? 6. Do you need a transformation? If yes, why and which one? 7. Are there any clear relationships between response and individual explanatory variables? 8. Are there interactions? 9. What do you think will be the final outcome of this analysis? Exercise 3; Ozone.xls Open the data set and import the data. The variable ozone is the ozone concentrations at 110 sites, and we want to model it as a function of wind, temperature and radiation. Save all your results in a Word document as you need them later. 1. Identify the response and explanatory variables. 2. Are there nominal explanatory variables? 3. Are there outliers in the response variable? 4. Are the outliers in the explanatory variables? 5. Is there collinearity? 6. Are there any clear relationships between response and individual explanatory variables? 7. Are there interactions? 8. What do you think will be the final outcome of this analysis? Exercise 4; parasite.xls Open the data set and import the data. This data set is about parasites in fish (cod) at different locations north of Norway. Parasite information is available as number per fish (Intensity) or presence-absence (Prevalence). One of these variables can be used as response variable (say Prevalence). We also have information on length, weight, sex, age, developing stage of the fish, and the year and area the fish was caught. 1. If these were your data, what would be your questions? 2. Identify the response and explanatory variables. 3. Are there nominal explanatory variables? Which ones? 4. Are there outliers in the response variable? 5. Are the outliers in the explanatory variables? 6. Is there collinearity? If there is, what are you going to do about it? 7. Are there any clear relationships between response and individual explanatory variables? A dotplot of the explanatory variable conditional on Prevalence is useful! Use different colours and group the observations using the options under settings. 8. What do you think will be the final outcome of this analysis? Exercise 5; RIKZ.xls Open the data file and import the data into Brodgar (use the worksheet “Species and Explan. Var.”) This data set is about marine benthic species measured on beaches in Holland. There are 75 species (all the columns in red) and the remaining columns are explanatory variables. All measurements were taken in June 2002. The names of the explanatory variables are self-explanatory, except perhaps for angle (angle of the beach and of individual stations on a beach) and NAP (the height of a site compared to average sea level). 1. If these were your data, what would be your questions? 2. We don’t want to work with multivariate analysis methods yet, so we will use the species richness diversity index. 3. Identify the response and explanatory variables. 4. Are there nominal explanatory variables? Which ones? 5. Are there missing values? 6. Are there outliers in the response variable? 7. Are the outliers in the explanatory variables? 8. Is there collinearity? If there is, what are you going to do about it? 9. Are there any clear relationships between response and individual explanatory variables? 10. What do you think will be the final outcome of this analysis? Exercise 6; decapodNew.xls Open the data set and import the data (use the worksheet “All Families”). These data were already discussed during the lecture. It is decapods measured close to Stonehaven and Loch Ewe. Again, we will use species richness as response variable. The columns in red are the families. 1. If these were your data, what would be your questions? 2. We don’t want to work with multivariate analysis methods yet, so we will use the species richness diversity index. 3. Identify the response and explanatory variables. 4. Are there nominal explanatory variables? Which ones? 5. Are there missing values? 6. Are there outliers in the response variable? 7. Are the outliers in the explanatory variables? 8. Is there collinearity? If there is, what are you going to do about it? 9. Are there any clear relationships between response and individual explanatory variables? 10. What do you think will be the final outcome of this analysis? Exercise 7; IrishpH.xls The data used are a subset of the data analysed in Cruikshanks et al. (2007), a technical report by the Environmental Protection Agency, Wexford, Ireland). We only use the 2003 data, and several recordings were dropped. So, our results may be different to those presented in the original report. The original research sampled 257 rivers in Ireland during 2002 and 2003. One of the aims was to find a different tool for identifying acid-sensitive waters, which currently uses measures of pH. The problem with pH is that it is extremely variable within a catchment, and depends on both flow conditions and underlying geology. As an alternative measure, the Sodium Dominance Index (SDI) is proposed as an indicator of the acid sensitivity of rivers. SDI is defined as the contribution of sodium (Na+) to the sum of the major cations. The motivation for this research is the increase in plantation forestry cover in Irish landscapes, and its potential impacts on aquatic resources. Of the 257 sites, 192 were non-forested and 65 were forested. For the moment, ignore the columns labelled “Easting” and “Northing”; these are relevant for a mixed modelling course. 1. If these were your data, what would be your questions? 2. Identify the response and explanatory variables. 3. Are there nominal explanatory variables? Which ones? 4. Are there missing values? 5. Are there outliers in the response variable? 6. Are the outliers in the explanatory variables? 7. Is there collinearity? If there is, what are you going to do about it? 8. Are there any clear relationships between response and individual explanatory variables? Are there potential interactions? 9. What do you think will be the final outcome of this analysis? Bivariate linear regression Exercise 8A; Loyn.xls. The aim of this exercise is to get familiar with bivariate linear regression. You would not normally start with bivariate linear regression if there are multiple explanatory variables. Apply bivariate linear regression to model bird abundance as a function of AREA. 1. What is the fitted model? 2. Are the parameters significant? Use two ways to assess this. 3. How much variation do you explain? 4. Apply a model validation; check all assumptions. Are there patterns in the residuals? Do you have normality and homogeneity? 5. How many birds do you expect if AREA is 100? Exercise 8B; Loyn.xls 1. Apply bivariate linear regression to model bird abundance as a function of GRAZE. This model is also called a one-way ANOVA. 1. Using the numerical output, how much variation do you explain? 2. In the output, where is graze level 1? 3. Explain the ANOVA table. 4. Predict the number of birds in graze type 1. 5. Predict the number of birds in graze type 3. 6. Predict the number of birds in graze type 5. 7. Apply a model validation. Exercise 9; wedgeclamII.xls The aim of this exercise is to get familiar with bivariate linear regression. You would not normally start with bivariate linear regression if there are multiple explanatory variables. Apply bivariate linear regression to model AFD as a function of LENGTH. For the moment, ignore the month effect. Interpret all the output 1. What is the fitted model? 2. Are the parameters significant? Use two ways to assess this. 3. How much variation do you explain? 4. Apply a model validation; check all assumptions. Are there patterns in the residuals? Do you have normality and homogeneity? 5. What is the predicted AFD (on the real scale) if length is 10? Exercise 10. Ozone.xls continued. The aim of this exercise is to get familiar with bivariate linear regression. You would not normally start with bivariate linear regression if there are multiple explanatory variables. Which is the best single explanatory variable that explains ozone? Exercise 11. RIKZ data continued. The aim of this exercise is to get familiar with bivariate linear regression. You would not normally start with bivariate linear regression if there are multiple explanatory variables. Is there a linear relationship between species richness and NAP? Multiple linear regression Exercise 12. Loyn data Find the optimal multiple linear regression model. Apply a model validation and model interpretation. What would you present in a paper or thesis? Exercise 13. Wedgeclam data Find the optimal multiple linear regression model. Apply a model validation and model interpretation. What would you present in a paper or thesis? Exercise 14. Ozone data Find the optimal multiple linear regression model. Apply a model validation and model interpretation. What would you present in a paper or thesis? Exercise 15. Irish Ph data Find the optimal multiple linear regression model. Apply a model validation and model interpretation. What would you present in a paper or thesis? Poisson (or NB) GLM or GAM Exercise 16. Amphibian road kills data. The data in the txt file Roadkills.txt come from a two year study on vertebrate roadkiill in a National Road of southern Portugal. The surveyed road has paved verges, with two lanes, and a moderate amount of traffic (less than 10,000 vehicles per day). Road surroundings are dominated by cork Quercus suber and holm oak Q. rotundifolia tree stands, named “montado” and open land including pastures, meadows and fallows. The road was inspected for amphibian roadkills every two weeks between March 1995 and March 1997. Surveys were made by car slowly (10–20 kilometres per hour) travelling along the road on the hard-shoulder. Each animal found dead was identified to species level, whenever possible, and its geographic location, on UTM coordinates, was determined with help of detailed cartography (1:2000) of horizontal and vertical road profiles and aerial photographs. All carcasses were removed from the road to avoid double counting. The underlying ecological question in this chapter is simple; is there a relationship between amphibian roadkills (TOT_N) and any of the explanatory variables? Table 16.1. List of explanatory variables and the abbreviation used in this chapter. Variable Abbreviation Open lands (ha) OPEN_L Olive grooves (ha) OLIVE Montado with shrubs (ha) MONT_S Montado without shrubs (ha) MONT Policulture (ha) POLIC Shrubs (ha) SHRUB Urban (ha) URBAN Water reservoirs (ha) WAT_RES Length of water courses (km) L_WAT_C Dirty road length (m) L_D_ROAD Paved road length (km) L_P_ROAD Distance to water reservoirs D_WAT_RES Distance to water courses D_WAT_COUR Distance to Natural Park (m) D_PARK Number of habitat Patches N_PATCH Edges perimeter P_EDGE Landscape Shannon diversity index L_SDI 1. First investigate whether there are any outliers in the explanatory variables. We decided to square root transform that following covariates: POLIC, WAT_RES, URBAN, OLIVE, L_P_ROAD, SHRUB, D_WAT_COUR. Do you agree with this? 2. Is there any collinearity? Use VIFs. 3. The variable D_PARK represents the spatial position. Plot each covariate against D_PARK using a multi-panel graph. What do you notice? 4. Model the roadkills as a function of the selected covariates. Apply a model validation, and sketch what the model is doing. Exercise 17. Hake GLM. Open the file hakeGLM.xls in the data.zip file. The presence of parasites can be used to identify fish stocks. The data consist of abundance of Grillotia sp. (the parasite) on hake measured at two different locations in the South Atlantic. The variables length, weight and sex of each fish are also available. • Identify the response and explanatory variables. Which is your underlying question? • Import the data. Tell the package what is the Y (response) and what the X (explanatory variables). • There are 2 nominal explanatory variables. Which ones? • Would you consider applying a transformation on the response variable? • Are the nominal explanatory variables (approximately) equally balanced? • Which variable has most missing values? • Is there any collinearity? • What is the name of the statistical method that you should apply? Apply it. Make sure you fully understand all numerical output. • Which is your final model? Exercise 18. Sea lice larvae distribution around Scottish fish farms The data used in this example are taken from Penston et al. (2008). Plankton tows were taken approximately weekly at two depths (0 meter and 5 meter) at five stations for two years. In the paper, numbers of naupliius and copepodids were analysed in two separate univariate analysis in which production week (time expressed in weeks since March 2002, when the local farms stocked their cages with lice-free, juvenile fish), station and depth were the covariates. There are five stations, labelled as A, C, E, F and G. Stations C and G are adjacent to salmon farms, stations A and F are landward of these farms, and E is seaward of the farms. Here, we only use copepodids. There are three potential problems with the analysis of these data; we have longitudinal data at each station, there may be correlation between adjacent stations and there is a large variation in the sampled water volume. As to the first two problems, we follow the same strategy as the paper, namely show that there is no temporal correlation within each of the residual time series, and that there are no strong Pearson correlations between the 5 residual time series. If this is not the case for your data, you have to use generalised linear mixed modelling methods. The data are in the file Licedata.txt. Concentrate on the Copepod variable (response variable). We want to model it as a function of station, depth and production week. 1. What is the maximum possible model? 2. Instead of doing a full backward selection, we are nice to you, and tell you what the optimal model is: A smoother for production week at depth 0 meter, a smoother for production week at depth 5 meter, and the main effects depth and station. There is one thing we did not tell you. What? 3. Fit the model in step 2. Exercise 19. Parasite species on sand perch Open the file Turco.xls in the data.zip file. The data consist of abundance of 21 parasite species on sand perch (Pinguipes brasilianus) measured at three different locations in the South Atlantic. The variables length, weight and sex of each fish are also available. The underlying idea is to know whether there is a relationship between the number of parasites and the length/weight of fish. Import the data. Tell the package what is the Y (response) and what the X (explanatory variables). • There are two nominal explanatory variables. Which one? • Would you consider applying a transformation on the response variable? • Are the nominal explanatory variables (approximately) equally balanced? • Can you identify variables with missing values? • Is there any collinearity? • Is there any scope for interaction? • What is the name of the statistical method that you should apply? Apply it. Make sure you fully understand all numerical output. • Which is your final model? Exercise 20: Monkfish data Analyse the variable TotalA in the file MonkFish.xls. Exercise 21: Dragon fly data; Poisson or ZIP GLM? The file DragonFlies.xls contains the number of mites on a dragonfly. Samples were taken at three locations. Possible covariates are the wing length of a dragonfly, location and temperature. There may be interactions. Analyse the data. Binomial GLM or GAM Exercise 22. Uta species Open the excel file polis.xls. Polis et al. (1998) studied the factors that control spider populations on islands in the Gulf of California. Here, we use part of their data and model the presence/absence of the lizards (Uta) as a function of the ratio of perimeter to area (PA), a measure of input of marine detritus. 1. Import the data. What is the response variable? 2. Which method do you need? Apply it. 3. What is the outcome? Exercise 23. BOAR. Boar.xls The presence of tuberculosis like-lesions and serum samples tested for antibodies to Brucella spp (classical swine fever virus), were analysed in European Wild boars (Sus scrofa). Samples were collected in south-central Spain and information about sex, age class and total length of the animal were also recorded. 1. Identify the response and explanatory variables. 2. There are 2 nominal explanatory variables. Which ones? 3. Import the data. 4. Does it make sense to investigate the response variable for outliers? 5. Are data equally balanced (you don’t want to have a nominal explanatory variable with only a few observations in one stratum, and lots of observations in another stratum)? 6. Which variable has most missing values? 7. Is there collinearity? 8. What is the name of the statistical method that you should apply? Apply it. 9. Does any explanatory variable explain the presence of Tuberculosis in wild boars? If so, how is this relationship? How would you write the model? 10. Sketch the fitted values in a graph. 11. Repeat the analysis, but now use Brucella as response variable. Exercise 24. Solea solea This is a real example and it is not easy. The solution is described in Chapter 22 in Zuur et al. (2007). Open the file Soleasolea.xls. We want to know whether presence/absence of solea solea can be modelled as a function of the explanatory variables. The only way you will be able to get something out of this analysis is by being highly critical what you put into the model in terms of explanatory variables. Use common sense to decide which explanatory variables are collinear (both in terms of statistics and biology)! Exercise 25. Parasite data in ParasiteCod.txt The red king crab Paralithodes camtschaticus was introduced in the Barents Sea in the 1960s and 1970s from its native area in the North Pacific. The leech Johanssonia arctica uses the carapace of this crab to deposit eggs. The leech is an intermediate host for a trypanosome blood parasite of marine fish, including cod. Hemmingsen et al. (2004) examined a large number of cod for trypanosome infections during annual cruises along the coast of Finnmark in North Norway. These cruises covered three years and were divided in four ‘stations’, or areas. Full details of the research and results can be found in their paper. Their statistical analyses were carried out using Chi-square statistics and analysis of variance, and are in principle all correct. Here, we use a subset of the data and repeat their analyses with GLM. The response variable is Prevalence, which is coded as 1 if the parasite is present, and 0 else. Possible explanatory variables are year, area and the depth that the fish was caught. 1. Investigate collinearity. 2. Hemmingsen et al. (2004) used a model with the main terms year, area, and length, and an interaction term year × area, and we will also use this set of covariates. Apply a GLM (or GAM) using this set of covariates Exercise 26. TB in deer The file tbdeer.xls contains data on TB in wild boar and red deer. There is information about a common disease in both species (Tuberculosis), and about a parasite which only infects red deer. There is information about the main characteristics of the habitat and management (fencing). Is TB in deer influenced by any of the covariates? Exercise 27. Trifur The file Trifur.xls comprises the presence/absence of the parasitic copepod Trifur tortuosus, a parasite infecting the body surface and grills of the Brazilian sand perch Pinguipes brasilianus. Fish samples were collected in 3 main areas of the Argentinean sea and the file also contains information on sex, length and weight of the fish. 1. Identify the response and explanatory variables. 2. There are 2 nominal explanatory variables. Which ones? 3. Import the data. 4. Are data equally balanced (you don’t want to have a nominal explanatory variable with only a few observations in one stratum, and lots of observations in another stratum)? 5. Are there missing values? 6. Is there collinearity? 7. Are there any possible interaction terms? 8. What is the name of the statistical method that you should apply? Apply it. 9. Does any explanatory variable explain the presence of parasites in fish? If so, what is this relationship? How would you write the model? 10. Sketch the fitted values in a graph. Extra exercise Exercise 28. Bailey et al. (2008) data. The file Baileyetal2008.xls contains fish abundance data. Since 1979 fisheries surveys have been taken in an area of the NE Atlantic Ocean, (ca. 50°N 13°W) assembling a unique fishery-independent dataset from trawls conducted on commercial fishing grounds (800-1500 m), and beyond on the slope (1500-4000 m) and abyssal plain (to 4800 m). Sampling took place in two sampling periods. The “Early” period (1979 to 1989) is before and during the development of the fishery, while the “Late” period (1997 to 2002; 64 trawls) is considered post commercial fishing. Gear and techniques used were identical throughout. See Bailey et al. (2008) for details of sampling. The variables are: Site Label TotAbund Total abundance of all fish Dens Density of all fish (= total abundance /sweeping area) MeanDepth Mean depth of a trawl Year The year that a trawl was taken Period Time period Xkm Spatial position Ykm Spatial position SweptArea The swept area during a trawl (= sampling effort). LogSweptArea The natural log of the swept area The response variables are Dens and TotAbund. The underlying question is whether the density – depth relationship has changed over time. The same question can be asked for total abundance. 1. Identify the response and explanatory variables. 2. There is 1 nominal explanatory variable. Which one? Why did we define Period? Can we use period and year? 3. Import the data. 4. Are data equally balanced (you don’t want to have a nominal explanatory variable with only a few observations in one stratum, and lots of observations in another stratum)? 5. Plot the spatial positions of the sites? What should we do? Are the data spatially balanced? How can you criticise any outcome? 6. Are there missing values? 7. Is there collinearity? Are covariates spatially collinear? 11. Are there any possible interaction terms? 12. What about sampling effort? 13. Regression/GAM part: What is the name of the statistical method that you should apply? Apply it. What is the optimal model? Apply a model validation 14. GLM part: Repeat the analysis for the abundance data. Reference: DM Bailey, MA Collins, JDM. Gordon, AF Zuur, IG Priede. (2008) Long-term changes in deep-water fish populations in the North East Atlantic: a deeper-reaching effect of fisheries? Journal Proceedings of the Royal Society: B. Exercise 29. Health care data The file hersdata.xls contains clinical trial data from Hulley et al. (1998). There are 2763 rows (people) in this data file. This research is about diabetics, glucose and cholesterol (LDL and HDL). Covariates are age, race, smoking, drinking, exercise, weight, BMI, statins (statins are a class of drugs to lower cholesterol), WHR (waist/hip ratio), etc. Have look at the data file, and also at the file hersedata.codebook.doc. We have deleted some of the original covariates to simplify the data set. 1. Import the data file. 2. Inspect all the response variables (glucose, LDL and HDL) for outliers. 3. A person is called diabetics if his/her glucose is larger than 125 mg /dL, while levels between 100 and 125 indicate potential trouble. Should we use the variable diabetics as a covariate? 4. Inspect all the covariates for outliers. 5. Investigate the covariates for collinearity. 6. For each response variable, investigate possible relationships and interactions Regression exercise Apply linear regression on LDL and model it as a function of the covariates. One of the questions is whether there is a BMI statins interaction. GAM exercise Model HDL as a function of all the covariates. Is the BMI effect linear?

Comments

Want to learn?

Sign up and browse through relevant courses.

Name:
Your Email:
Password:
Country:
Contact no:


Area code Number
Subjects you are interested in:
Word verification: (Enter the text as in image)


Sign Up Already a member? Sign In
I agree to WizIQ's User Agreement & Privacy Policy

Your Facebook Friends on WizIQ

Give live classes, create & sell online courses

Try it free Plans & Pricing

Connect