Data may come from a population or from a sample. Lowercase letters like
- Qualitative
- Quantitative
Qualitative data are the result of categorizing or describing attributes of a population. Qualitative data are also often called categorical data. Hair color, blood type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. Qualitative data are generally described by words or letters. For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For example, it does not make sense to find an average hair color or blood type.
Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. A person's pulse rate, weight, number of athletes on a sports team, and number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous.
All data that are the result of counting are called quantitative discrete data. These data take on only certain numerical values. If you count the number of steps you accumulate for each day of the week, you might get values such as 5,000, 10,000 or 20,000.
Data that are not only made up of counting numbers, but that may include fractions, decimals, or irrational numbers, are called quantitative continuous data. Continuous data are often the results of measurements like lengths, weights, or times. A list of the body weights for participants in a weight loss research study, with numbers like 110.1, or 150.5, would be quantitative continuous data.
The data are the number of injury reports submitted by players on a university's soccer team during a single season. You sample six players from the team. Specifically, two players submitted zero injury reports, another two players submitted one injury report, one player submitted three injury reports, and the final player submitted five injury reports. The numbers of injury reports (zero, one, three, and five) are the quantitative discrete data.
Data Sample of Quantitative Continuous Data
The data are the body weights of players on a university's soccer team, measured during pre-season training. You sample six players from the team. The measured weights (in kilograms) of the sampled players are: 68.2 kg, 75.0 kg, 69.8 kg, 81.3 kg, 70.5 kg, and 75.0 kg. The weights are quantitative continuous data.
A researcher conducts a fitness assessment on a group of young adults and records the following observations for each participant:
Three time-based performance measures (in seconds), such as time to complete a 40-yard sprint, the duration of a seated wall squat, and the reaction time to a visual cue. Two demographic variables, such as the participant's preferred mode of daily transportation (e.g., walking, cycling, driving) and their stated primary sport (e.g., soccer, basketball, running). Three count-based measures, such as the number of push-ups completed in one minute, the number of self-reported concussions in the past year, and the number of resistance training sessions per week.
Problem
Name data sets that are quantitative discrete, quantitative continuous, and qualitative.
- Answer
-
One Possible Solution:
Quantitative Discrete Data includes the number of self-reported concussions in the past year, as this results from counting distinct, whole units.
Quantitative Continuous Data includes the time to complete a 40-yard sprint, as this is a measurement that can take on any value within a range.
Qualitative (Categorical) Data includes the participant's preferred mode of daily transportation, as this places individuals into distinct, non-numerical groups.
Try to identify additional data sets in this example.
Determine the correct data type (quantitative or qualitative). Indicate whether quantitative data are continuous or discrete. Hint: Data that are discrete often start with the words "the number of."
- the number of pairs of shoes you own
- the type of car you drive
- the distance from your home to the nearest grocery store
- the number of classes you take per school year
- the type of calculator you use
- weights of dogs at an animal shelter
- number of correct answers on a quiz
- IQ scores (This may cause some discussion.)
- Answer
-
Items a, d, and g are quantitative discrete; items c, f, and h are quantitative continuous; items b and e are qualitative, or categorical.
A statistics professor collects information about the classification of her students as first-year students, sophomores, juniors, or seniors. The data she collects are summarized in the pie chart Figure \(\PageIndex{1}\). What type of data does this graph show?
- Answer
-
This pie chart shows the students in each year, which is qualitative (or categorical) data.
Qualitative Data Discussion
Below are tables comparing the number of part-time and full-time students at De Anza College and Foothill College enrolled for the most recent spring quarter. The tables display counts (frequencies) and percentages or proportions (relative frequencies). The percent columns make comparing the same categories in the colleges easier. Displaying percentages along with the numbers is often helpful, but it is particularly important when comparing sets of data that do not have the same totals, such as the total enrollments for both colleges in this example. Notice how much larger the percentage for part-time students at Foothill College is compared to De Anza College.
| De Anza College | Foothill College | ||||
|---|---|---|---|---|---|
| Number | Percent | Number | Percent | ||
| Full-time | 9,200 | 40.9% | Full-time | 4,059 | 28.6% |
| Part-time | 13,296 | 59.1% | Part-time | 10,124 | 71.4% |
| Total | 22,496 | 100% | Total | 14,183 | 100% |
Tables are a good way of organizing and displaying data. But graphs can be even more helpful in understanding the data. There are no strict rules concerning which graphs to use. Two graphs that are used to display qualitative data are pie charts and bar graphs.
In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category.
In a bar graph, the length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal.
A Pareto chart consists of bars that are sorted into order by category size (largest to smallest).
Look at Figure \(\PageIndex{3}\) and \(\PageIndex{4}\) and determine which graph (pie or bar) you think displays the comparisons better.
It is a good idea to look at a variety of graphs to see which is the most helpful in displaying the data. We might make different choices of what we think is the “best” graph depending on the data and the context. Our choice also depends on what we are using the data for.
Percentages That Add to More (or Less) Than 100%
Sometimes percentages add up to be more than 100% (or less than 100%). In the graph, the percentages add to more than 100% because students can be in more than one category. A bar graph is appropriate to compare the relative size of the categories. A pie chart cannot be used. It also could not be used if the percentages added to less than 100%.
| Characteristic/Category | Percent |
|---|---|
| Full-Time Students | 40.9% |
| Students who intend to transfer to a 4-year educational institution | 48.6% |
| Students under age 25 | 61.0% |
| TOTAL | 150.5% |
Omitting Categories/Missing Data
The table displays Ethnicity of Students but is missing the "Other/Unknown" category. This category contains people who did not feel they fit into any of the ethnicity categories or declined to respond. Notice that the frequencies do not add up to the total number of students. In this situation, create a bar graph and not a pie chart.
| Frequency | Percent | |
|---|---|---|
| Asian | 8,794 | 36.1% |
| Black | 1,412 | 5.8% |
| Filipino | 1,298 | 5.3% |
| Hispanic/Latino | 4,180 | 17.1% |
| Native American | 146 | 0.6% |
| Pacific Islander | 236 | 1.0% |
| White | 5,978 | 24.5% |
| TOTAL | 22,044 out of 24,382 | 90.4% out of 100% |
The following graph is the same as the previous graph but the “Other/Unknown” percent (9.6%) has been included. The “Other/Unknown” category is large compared to some of the other categories (Native American, 0.6%, Pacific Islander 1.0%). This is important to know when we think about what the data are telling us.
This particular bar graph in Figure 1.9 can be difficult to understand visually. The graph in Figure 1.10 is a Pareto chart. The Pareto chart has the bars sorted from largest to smallest and is easier to read and interpret.
Pie Charts: No Missing Data
The following pie charts have the “Other/Unknown” category included (since the percentages must add to 100%). The chart in Figure \(\PageIndex{9}\)(b) is organized by the size of each wedge, which makes it a more visually informative graph than the unsorted, alphabetical graph in Figure \(\PageIndex{9}\)(a).
Sampling
Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing. Most statisticians use various methods of random sampling in an attempt to achieve this goal. This section will describe a few of the most common methods. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons. The easiest method to describe is called a simple random sample. Any group of n individuals is equally likely to be chosen as any other group of n individuals if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected.
Besides simple random sampling, there are other forms of sampling that involve a chance process for getting the sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and the systematic sample.
To choose a stratified sample, divide the population into groups called strata and then take a proportionate number from each stratum. For example, you could stratify (group) your county's adult population by age category (e.g., 18–30, 31–50, 51–70, 71+) and then choose a proportionate simple random sample from each stratum (each age group) to get a stratified random sample. This ensures your final sample accurately reflects the age distribution of the county. To choose a simple random sample from each age group, you would number each member of the first group, number each member of the second group, and do the same for the remaining groups. Then use simple random sampling to choose proportionate numbers from the first group and do the same for each of the remaining groups. Those numbers picked from the first age group, picked from the second age group, and so on represent the members who make up the stratified sample.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these selected clusters are in the cluster sample. For example, if you randomly sample four elementary schools from your city's entire public school system to study childhood nutrition habits, the four selected schools make up the cluster sample. Divide your city's school population by school building. The individual school buildings are the clusters. Number each school, and then choose four different numbers using simple random sampling. All students enrolled in the four schools with those numbers are the cluster sample.
To choose a systematic sample, randomly select a starting point and take every nth piece of data from a listing of the population. For example, suppose you have to do a physical activity surveillance study for a large corporate campus. Your employee directory contains 5,000 listings. You must choose 100 names for the sample. You would number the population 1–5,000 and then use a simple random sample to pick a number that represents the first employee in the sample. Since 5,000 / 100 = 50, you would then choose every fiftieth name thereafter until you have a total of 100 names (you might have to circle back to the beginning of your list). Systematic sampling is frequently chosen because it is a simple method, especially when dealing with long lists.
A type of sampling that is non-random is convenience sampling. Convenience sampling involves using results that are readily available. For example, a kinesiologist conducting preliminary research on a new exercise device conducts a study by interviewing clients who happen to be visiting their physical therapy clinic that day. The results of convenience sampling may be very good for pilot studies in some cases and highly biased (favor certain outcomes) in others, as the sample (clinic patients) may not represent the general population.
Sampling data should be done very carefully. Collecting data carelessly can have devastating results. Surveys distributed via a mass email to a university's entire student body, and then returned by students who choose to participate, may be very biased (they may favor a certain group, such as those with strong opinions on campus health policies). It is better for the person conducting the survey to select the sample respondents using a probability-based method.
In statistics, a sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen). When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied.
Critical Evaluation
We need to evaluate the statistical studies we read about critically and analyze them before accepting the results of the studies. Common problems to be aware of include
- Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased. Biased samples that are not representative of the population give results that are inaccurate and not valid.
- Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
- Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions. Examples: crash testing cars or medical testing for rare conditions
- Undue influence: collecting data or asking questions in a way that influences the response
- Non-response or refusal of subject to participate: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.
- Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
- Self-funded or self-interest studies: A study performed by a person or organization in order to support their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not automatically assume that the study is good, but do not automatically assume the study is bad either. Evaluate it on its merits and the work done.
- Misleading use of data: improperly displayed graphs, incomplete data, or lack of context
- Confounding: When the effects of multiple factors on a response cannot be separated. Confounding makes it difficult or impossible to draw valid conclusions about the effect of each factor.
A study is done to determine the average number of minutes of moderate-to-vigorous physical activity (MVPA) that all university undergraduate students engage in per week. Each student in the following samples is asked to report their average weekly MVPA. What is the type of sampling in each case?
a. A sample of 100 undergraduate students is taken by organizing the students’ names by class standing (first-year student, sophomore, junior, or senior), and then selecting 25 students from each classification.
b. A random number generator is used to select a student from the alphabetical listing of all undergraduate students. Starting with that student, every 40th student is chosen until 50 students are included in the sample.
c. A completely random method is used to select 50 students. Each undergraduate student has the same probability of being chosen at any stage of the sampling process.
d. The major academic colleges (Engineering, Arts & Sciences, Education, Business) are numbered one, two, three, and four, respectively. A random number generator is used to pick two of those colleges. All students enrolled in those two colleges are in the sample.
e. A research assistant is asked to stand in front of the campus recreation center one Friday morning and to ask the first 100 undergraduate students he encounters what their average weekly MVPA is. Those 100 students are the sample.
- Answer
-
a. stratified; b. systematic; c. simple random; d. cluster; e. convenience
Solution
a. stratified; b. systematic; c. simple random; d. cluster; e. convenience
You are going to use the random number generator to generate different types of samples from the data.
This table displays six sets of Rating of Perceived Exertion (RPE) exercise scores (maximum exertion = 10) for six adults participating in an exercise study during a 20-minute treadmill test.
| #1 | #2 | #3 | #4 | #5 | #6 |
|---|---|---|---|---|---|
| 5 | 7 | 10 | 9 | 8 | 3 |
| 10 | 5 | 9 | 8 | 7 | 6 |
| 9 | 10 | 8 | 6 | 7 | 9 |
| 9 | 10 | 10 | 9 | 8 | 9 |
| 7 | 8 | 9 | 5 | 7 | 4 |
| 9 | 9 | 9 | 10 | 8 | 7 |
| 7 | 7 | 10 | 9 | 8 | 8 |
| 8 | 8 | 9 | 10 | 8 | 8 |
| 9 | 7 | 8 | 7 | 7 | 8 |
| 8 | 8 | 10 | 9 | 8 | 7 |
Instructions: Use the Random Number Generator to pick samples.
- Create a stratified sample by column. Pick three RPE scores randomly from each column.
- Number each row one through ten.
- On your calculator, press Math and arrow over to PRB.
- For column 1, Press 5:randInt( and enter 1,10). Press ENTER. Record the number. Press ENTER 2 more times (even the repeats). Record these numbers. Record the three RPE scores in column one that correspond to these three numbers.
- Repeat for columns two through six.
- These 18 RPE scores are a stratified sample.
- Create a cluster sample by picking two of the columns. Use the column numbers: one through six.
- Press MATH and arrow over to PRB.
- Press 5:randInt( and enter 1,6). Press ENTER. Record the number. Press ENTER and record that number.
- The two numbers are for two of the columns.
- The RPE scores (20 of them) in these 2 columns are the cluster sample.
- Create a simple random sample of 15 RPE scores.
- Use the numbering one through 60.
- Press MATH. Arrow over to PRB. Press 5:randInt( and enter 1, 60).
- Press ENTER 15 times and record the numbers.
- Record the RPE scores that correspond to these numbers.
- These 15 RPE scores are the random sample.
- Create a systematic sample of 12 RPE scores.
- Use the numbering one through 60.
- Press MATH. Arrow over to PRB. Press 5:randInt( and enter 1, 60).
- Press ENTER. Record the number and the first RPE score. From that number, count ten RPE scores and record that RPE score. Keep counting ten RPE scores and recording the RPE score until you have a sample of 12 RPE scores. You may wrap around (go back to the beginning).
Determine the type of sampling used (simple random, stratified, systematic, cluster, or convenience).
- A soccer coach selects six players from a group of boys aged eight to ten, seven players from a group of boys aged 11 to 12, and three players from a group of boys aged 13 to 14 to form a recreational soccer team.
- A researcher interviews all school teachers in five different school districts.
- A high school educational researcher interviews 50 public high school teachers and 50 private high school teachers.
- A medical researcher interviews every third cancer patient from a list of cancer patients at a local hospital.
- A high school counselor uses a computer to generate 50 random numbers and then picks students whose names correspond to the numbers.
- A research assistant interviews exercise class participants in their study to determine how many classes they participate in each week, on the average.
- Answer
-
a. stratified; b. cluster; c. stratified; d. systematic; e. simple random; f. convenience
Solution
a. stratified; b. cluster; c. stratified; d. systematic; e. simple random; f.convenience
In Confidence Intervals of the text, sample size formulas are provided which will determine sample sizes when sampling from a population. The sample size will be a function of the desired precision and not a function of the population size. It may be somewhat counterintuitive that the sample size does not depend on the population size. However, this implies that a sample size of 1,000 can be adequate to represent a population of 100,000 versus 1,000,000 given that the same level of precision is desired. When working in Confidence Intervals with sample size formulas, the student will notice that population size is not a factor in determining the sample size.
Suppose State University has 15,000 part-time students (the population). We are interested in the average number of steps a part-time student takes per day during the fall term. Asking all 15,000 students is an almost impossible task.
Suppose we take two different samples:
Sample 1: Convenience Sampling
We use convenience sampling and survey ten students who are attending a required Advanced Anatomy and Physiology lab session. The daily step counts they report are as follows:
12,500; 11,800; 14,100; 10,900; 13,500; 15,200; 11,500; 13,800; 10,500; 12,900
Sample 2: Systematic Sampling (Biased Subset)
The second sample is taken using a list of senior citizens enrolled in the university's community outreach health and wellness program, which includes several low-impact exercise classes. We take every fifth senior citizen on the list, for a total of ten senior citizens. They report daily step counts:
4,500; 3,200; 5,100; 3,900; 4,800; 6,000; 3,500; 4,200; 3,100; 5,800
It is unlikely that any student is in both samples.
Problem
a. Do you think that either of these samples is representative of (or is characteristic of) the entire 15,000 part-time student population?
Solution
a. No. The first sample probably consists of science-oriented students. Besides the chemistry course, some of them are also taking first-term calculus. Books for these classes tend to be expensive. Most of these students are, more than likely, paying more than the average part-time student for their books. The second sample is a group of senior citizens who are, more than likely, taking courses for health and interest. The amount of money they spend on books is probably much less than the average parttime student. Both samples are biased. Also, in both cases, not all students have a chance to be in either sample.
- Answer
-
a. No. Neither sample is likely to be representative of the entire part-time student population, and both introduce sampling bias.
Sample 1 probably consists of students in demanding science or pre-health majors who may be younger, more health-conscious, and have busier, more active schedules than the average part-time student. Their reported step counts (averaging over 12,500) are likely higher than the population average, introducing an upward bias.
Sample 2 consists of senior citizens who, despite being active members of a wellness program, are likely older and have lower mobility than the typical part-time student population. Their reported step counts (averaging under 4,500) are likely lower than the population average, introducing a downward bias.
In both cases, not all 15,000 part-time students had an equal chance of being included, resulting in non-representative samples for estimating the true average daily step count.
b. Since these samples are not representative of the entire population, is it wise to use the results to describe the entire population?
- Answer
-
b. No. For these samples, each member of the population did not have an equally likely chance of being chosen.
Now, suppose we take a third sample. We choose ten different part-time students who are taking classes across several different university colleges or academic departments: Kinesiology, Mathematics, English, Psychology, Public Health, History, Nursing, Business, Art, and Education. (We assume that these are the disciplines in which part-time students are enrolled and that an equal number of part-time students are enrolled in each discipline.) Each student is chosen using simple random sampling within their department to ensure a diverse representation. The students report the following average daily step counts:
7,800; 8,500; 7,200; 9,100; 10,500; 6,500; 9,500; 8,200; 7,500; 9,300
Problem
c. Is the sample biased?
- Answer
-
This sample is likely less biased than Samples 1 and 2. By using a form of stratified sampling (taking one random student from each major/discipline), the researcher has ensured that the sample is representative of the academic diversity of the entire part-time student population. The goal of using this diverse, probability-based approach is to minimize the bias seen in the previous samples, where students were selected based on high-activity classes (Sample 1) or low-activity demographics (Sample 2). The resulting average step count of this sample (approximately 8,460 steps) is a much more trustworthy estimate of the population average. Students often ask if it is "good enough" to take a sample, instead of surveying the entire population. If the survey uses probability sampling methods and is done well, the answer is yes.
Variation in Data
Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following amount (in ounces) of beverage:
118; 125; 115; 122; 119; 120; 117; 124
Measurements of the systolic blood pressure may vary because different people make the measurements (e.g., observer error) or because the subject’s exact physiological state (e.g., recent activity, stress) was not the same at the time of each measurement. Health researchers regularly run tests to determine if the SBP readings in a study group fall within the desired range or exhibit a concerning level of fluctuation.
Be aware that as you take data, your data may vary somewhat from the data someone else is taking for the same purpose. This is completely natural. However, if two or more of you are taking the same data and get very different results, it is time for you and the others to reevaluate your data-taking methods and your accuracy.
Variation in Samples
It was mentioned previously that two or more samples from the same population, taken randomly, and having close to the same characteristics of the population will likely be different from each other. Suppose Doreen and Jung both decide to study the average amount of time students at their college sleep each night. Doreen and Jung each take samples of 500 students. Doreen uses systematic sampling and Jung uses cluster sampling. Doreen's sample will be different from Jung's sample. Even if Doreen and Jung used the same sampling method, in all likelihood their samples would be different. Neither would be wrong, however.
Think about what contributes to making Doreen’s and Jung’s samples different.
If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample results (the average amount of time a student sleeps) might be closer to the actual population average. But still, their samples would be, in all likelihood, different from each other. This variability in samples cannot be stressed enough.
Size of a Sample
The size of a sample (often called the number of observations, usually given the symbol n) is important. The examples you have seen in this book so far have been small. Samples of only a few hundred observations, or even smaller, are sufficient for many purposes. In polling, samples that are from 1,200 to 1,500 observations are considered large enough and good enough if the survey is random and is well done. Later we will find that even much smaller sample sizes will give very good results. You will learn why when you study confidence intervals.
Be aware that many large samples are biased. For example, call-in surveys are invariably biased, because people choose to respond or not.


