Jason Greenberg

Research Question: Starting as early as their undergraduate career, do women face barriers to entry into or have strong preferences against entering STEM related fields, specifically the sciences and engineering, due to their gender? Also, does region within the United States impact this gender disparity?

Changing gender expectations and increasingly more equal societal treatment of women have led researchers from different disciplines to analyze what may contribute to the lingering gender gap in earned wages. Many factors influence a worker's salary, including job industry, experience, inherent ability, lifestyle preferences, and performance. Many of these predictors are subjective and difficult to measure. This presentation will not focus on explaining the causes of higher median incomes for men, but will instead examine the gender disparity in college major choice, which is an indicator of eventual career track and earnings. Systematic gender preferences for and against science and engineering majors are visibile after looking at undergraduate major data from the American Community Survey from 2015. Moreover, while there exists a strangely constant gap between the number of male versus female science and engineering majors across the country at the state level, the relative percentages of male degree holders with majors in science and engineering against their female parallel figures suggest that different parts of the country face varying levels of gender disparity in the sciences at the undergraduate level.

In [1]:
library(ggplot2)
library(maps)
library(RColorBrewer)
Warning message:
"package 'maps' was built under R version 3.3.3"

These r packages above are necessary to run the graphics that will support the argument developed.

In [2]:
bachelors <- read.csv("bachelors.csv", header = TRUE, stringsAsFactors = FALSE)
 dim(bachelors)
 head(bachelors)
  1. 581
  2. 291
GEO.idGEO.id2GEO.display.labelHC01_EST_VC01HC01_MOE_VC01HC02_EST_VC01HC02_MOE_VC01HC03_EST_VC01HC03_MOE_VC01HC04_EST_VC01...HC02_EST_VC27HC02_MOE_VC27HC03_EST_VC27HC03_MOE_VC27HC04_EST_VC27HC04_MOE_VC27HC05_EST_VC27HC05_MOE_VC27HC06_EST_VC27HC06_MOE_VC27
Id Id2 Geography Total; Estimate; Total population 25 years and over with a Bachelor's degree or higher Total; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher Percent; Estimate; Total population 25 years and over with a Bachelor's degree or higher Percent; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher Males; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher Percent Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher ... Percent; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others
0400000US01 1 Alabama 792876 14677 (X) (X) 366201 9531 (X) ... 17.8 1.6 14284 1696 16.4 1.8 14435 1926 19.4 2.4
0400000US02 2 Alaska 139416 4807 (X) (X) 67843 3222 (X) ... 18.9 3.6 1951 565 17.1 4.4 2090 580 20.9 5.2
0400000US04 4 Arizona 1257449 16239 (X) (X) 621477 10626 (X) ... 18.4 1 26996 2115 15.6 1.1 30518 2468 21.9 1.6
0400000US05 5 Arkansas 433381 7690 (X) (X) 198339 5631 (X) ... 15.8 1.7 6777 1168 14 2.4 7330 1299 17.9 2.6
0400000US06 6 California 8415690 37555 (X) (X) 4123037 23131 (X) ... 23.6 0.5 157982 5589 18.9 0.6 211503 6420 29.1 0.9

The "bachelors.csv" file includes information on bachelor's degree holders from 2015 for men and women across the United States and Puerto Rico in various geographical regions. As this was a dataset used for Problem Set 2 of the class, no major data cleaning was necessary. For the purposes of this presentation, only the combined state figures and not the urban, rural, or city specific data will be used.

In [3]:
desired_columns <- c(3, 4, 16, 28, 40, 52, 64)
desired_rows <- seq(2,53) #all states, Washington DC, and Puerto Rico
subsetTotal <- bachelors[desired_rows, desired_columns] 
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
dim(subsetTotal)
subsetTotal
  1. 52
  2. 7
StateTotalSciEngSciEngRelatedBusinessEducationHumArts
2Alabama 792876 232948 79424 182750 135842 161912
3Alaska 139416 50587 12854 22495 20385 33095
4Arizona 1257449 416610 119456 271892 183272 266219
5Arkansas 433381 126510 42437 95297 83480 85657
6California 8415690 3427565 674278 1586921 563581 2163345
7Colorado 1440776 553496 121176 286139 147138 332827
8Connecticut 948044 342674 80873 187022 100734 236741
9Delaware 201929 70332 21046 44771 28364 37416
10District of Columbia268345 125167 11869 33555 11676 86078
11Florida 4092338 1283693 414601 995869 589792 808383
12Georgia 2000113 641664 178239 485769 273021 421420
13Hawaii 309194 110773 27619 61107 41200 68495
14Idaho 276912 92786 30241 50346 45560 57979
15Illinois 2853540 932408 269244 619650 384915 647323
16Indiana 1088120 305153 134069 226250 185666 236982
17Iowa 556591 169312 54940 116686 103090 112563
18Kansas 599063 175589 64756 129649 105871 123198
19Kentucky 696174 203689 78840 137156 119107 157382
20Louisiana 718058 200267 88487 144280 121677 163347
21Maine 289553 102732 28277 39879 45189 73476
22Maryland 1591614 645663 138044 295580 157419 354908
23Massachusetts 1951689 780836 158139 364241 175035 473438
24Michigan 1870473 609561 194451 397446 277764 391251
25Minnesota 1284007 429808 123550 252884 185440 292325
26Mississippi 406599 100307 55009 84900 87213 79170
27Missouri 1140860 342021 113391 246469 184485 254494
28Montana 216174 71556 23757 35787 39252 45822
29Nebraska 372288 103636 40424 84020 71497 72711
30Nevada 463681 147490 42672 105846 61003 106670
31New Hampshire 334313 123948 31664 63342 42482 72877
32New Jersey 2318073 854811 195024 527062 259221 481955
33New Mexico 364462 127327 32978 57214 58421 88522
34New York 4778463 1623429 415315 895311 548835 1295573
35North Carolina 1991057 687074 182823 402340 269287 449533
36North Dakota 143403 40867 20868 27786 27259 26623
37Ohio 2115116 650627 233599 448383 342930 439577
38Oklahoma 630004 182916 61146 143828 122057 120057
39Oregon 901667 347808 79655 131021 100430 242753
40Pennsylvania 2641023 874686 276665 530252 396210 563210
41Rhode Island 238818 80596 22368 45589 29037 61228
42South Carolina 890241 283678 85236 197700 131685 191942
43South Dakota 154885 47184 15976 29897 33840 27988
44Tennessee 1151080 342842 117853 256625 171712 262048
45Texas 4955374 1719782 445895 1164460 632224 993013
46Utah 554712 180989 55741 104661 77687 135634
47Vermont 162072 60132 12850 18942 22690 47458
48Virginia 2102044 863355 159477 397325 201524 480363
49Washington 1670893 685505 146243 274943 166411 397791
50West Virginia 254414 74797 32048 44434 51203 51932
51Wisconsin 1112458 340019 125112 225032 182504 239791
52Wyoming 102034 35634 10228 14729 21140 20303
53Puerto Rico 590228 146944 63591 190620 112350 76723

This dataframe presents the total number of bachelor's degrees for those aged 25 or older from the year 2015 for all 50 states, Washington DC, and Puerto Rico. Five categories of majors are included. The US Census Bureau American Community Survey defines science and engineering related majors to include nursing, architecture, and mathematics teacher education degrees, while the science and engineering category includes biology, chemistry, physics, mathematics, computer science, and social science degrees.

In [5]:
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
bandNames <- colnames(subsetTotal[,3:7])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(subsetTotal[,j+2])/as.numeric(subsetTotal[,2]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

Before beginning an analysis of gender and region bias in college major selection for men and women, it is helpful to see the distributions of majors for the combined figures that include men and women for all observation points. The visual above includes five histograms for each category of college major type. The x-axis measures percentage of degree holders for that major and ranges from 0 to 70% as defined by the "breaks = seq(0,0.7, by=0.05)" code, while the y-axis indicates frequency in terms of number of states, which ranges from 0 to 50. Each bar represents a bin range of 5%. The use of "par" and the forloop generate the grouped set of histograms, and the individual histogram titles connect to the original dataframe "subsetTotal" column names through the use of the "colnames" function. For this non region specific state major data for all degree holders over the age of 25, science and engineering majors represented the highest percentage of total degrees. This is signified by high median levels, over 20 states, being around 30% of all degrees in the states measured and a relatively even, bell-curve shaped distribution. Meanwhile, the lower median levels for science and engineering related fields, 30 states having between 5 and 10 percent of degree holders with this type of degree, and education major degrees, about 20 states having 10 to 15 percent of these degrees, signifies lower popularity.

In [6]:
Desired_columnsMale <- c(8, 20, 32, 44, 56, 68) #men totals
Desired_rowsMale <- seq(2,53) #all states
SubsetMale <- bachelors[Desired_rowsMale, Desired_columnsMale] 
colnames(SubsetMale) <- c("TotalMale", "SciEngMale", "SciEngRelatedMale", "BusinessMale", 
                          "EducationMale", "HumArtsMale")
bandNames <- colnames(SubsetMale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetMale[,j+1])/as.numeric(SubsetMale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

Desired_columnsFemale <- c(12, 24, 36, 48, 60, 72) #women totals
Desired_rowsFemale <- seq(2,53) #all states
SubsetFemale <- bachelors[Desired_rowsFemale, Desired_columnsFemale] 
colnames(SubsetFemale) <- c("TotalFemale", "SciEngFemale", "SciEngRelatedFemale", "BusinessFemale", 
                            "EducationFemale", "HumArtsFemale")
bandNames <- colnames(SubsetFemale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetFemale[,j+1])/as.numeric(SubsetFemale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

These two sets of five histograms divided by gender help display the differences in frequencies of major choice for men and women. The same coding technique and structure were used in the original non-gender specific set of histograms above. This time, new subsets "subsetMale" and "subsetFemale" were used as opposed to "subsetTotal," where the data came from gender specific columns from the original Excel file. One of the most drastic differences in the distributions exists in the science and engineering major histograms. The center of the male SciEng distribution is about 20% higher than the center of the female SciEng distribution. No other college major type faces this sort of gender disparity. Further statistical analysis will be able to help clarify on some aspects of the relationship between the number of men with science and engineering degrees and the number of women with science and engineering degrees.

In [128]:
desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentMaleTotal <- as.numeric(subset$TotalMale)/as.numeric(subset$Total)*100
subset$percentFemaleTotal <- as.numeric(subset$TotalFemale)/as.numeric(subset$Total)*100

subset$percentOfMaleInSciEng <- as.numeric(subset$SciEngMale)/as.numeric(subset$SciEngTotal)*100     #this pair of percentages will sum to 100%
subset$percentOfFemaleInSciEng <- as.numeric(subset$SciEngFemale)/as.numeric(subset$SciEngTotal)*100

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #while this pair has no immediate, direct relationship
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$percentSciEngRelatedMale <- as.numeric(subset$SciEngRelatedMale)/as.numeric(subset$TotalMale)*100          
subset$percentSciEngRelatedFemale <- as.numeric(subset$SciEngRelatedFemale)/as.numeric(subset$TotalFemale)*100

subset$percentBusinessMale <- as.numeric(subset$BusinessMale)/as.numeric(subset$TotalMale)*100
subset$percentBusinessFemale <- as.numeric(subset$BusinessFemale)/as.numeric(subset$TotalFemale)*100

subset$percentEducationMale <- as.numeric(subset$EducationMale)/as.numeric(subset$TotalMale)*100
subset$percentEducationFemale <- as.numeric(subset$EducationFemale)/as.numeric(subset$TotalFemale)*100

subset$percentHumanitiesMale <- as.numeric(subset$HumanitiesMale)/as.numeric(subset$TotalMale)*100
subset$percentHumanitiesFemale <- as.numeric(subset$HumanitiesFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale

head(subset)
StateTotalTotalMaleTotalFemaleSciEngTotalSciEngMaleSciEngFemaleSciEngRelatedTotalSciEngRelatedMaleSciEngRelatedFemale...percentSciEngFemalepercentSciEngRelatedMalepercentSciEngRelatedFemalepercentBusinessMalepercentBusinessFemalepercentEducationMalepercentEducationFemalepercentHumanitiesMalepercentHumanitiesFemaleSciEngRatio
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 28.93829 5.052843 13.16977 18.57819 13.81946 9.125481 19.83150 23.20799 24.24098 0.6571582
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 24.19462 5.564647 13.34540 25.39499 17.93601 7.067840 21.91087 19.69598 22.61310 0.5722941
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 20.27850 4.650623 14.13067 27.88357 17.01526 8.265142 28.54256 19.44701 20.03302 0.5101041
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 32.30986 5.075628 10.83265 20.86023 16.93233 3.194441 10.06075 21.37669 29.86442 0.6528166
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 30.24014 5.223841 11.46444 23.05031 16.80261 5.171365 15.04361 19.60642 26.44920 0.6441191

By taking the number of degree holders for each major for men and women and then dividing them by the total number of degree holders for each gender, a percentage of degree holders for each state, major, and gender can be derived. Having access to both raw counts and relative figures is important for a more complete analysis. Clarifying on the annotations in the code for the second and third pairs of calculations above, the "percentOfMaleInSciEng" and "percentofFemaleinSciEng" indicate the percentage of men and women with science and engineering degrees compared to the sums of the two genders of degree holders. This is why the two values will sum to 100%. Meanwhile, "percentSciEngMale" and "percentSciEngFemale" signify the percentage of men and women who hold science and engineering degrees compared to other degrees, not the other gender, which is why these two percentages will most likely not sum to 100%. Both computations are important for understanding the relationship between men, women, and degree choice.

In [8]:
median(as.numeric(subset$TotalMale))
median(as.numeric(subset$TotalFemale))

median(as.numeric(subset$SciEngMale))
median(as.numeric(subset$SciEngFemale))
342900.5
412566.5
134348.5
84527

By calculating the median number of total degree holders per state for men and women, it can be seen that women have more degrees per state on average. The ratio of women degree holders to men degree holders is about 1.2 to 1.0 while the ratio of female science and engineering degree holders to male science and engineering degree holders is about 1.0 to 1.59, which indicates that even with higher female averages in general, men have many more degrees in science and engineering than women on average.

In [9]:
sum(as.numeric(subset$TotalMale))
sum(as.numeric(subset$TotalFemale))

sum(as.numeric(subset$SciEngMale))
sum(as.numeric(subset$SciEngFemale))
31870667
34961114
13925554
9244229

Computing the same ratios for all state counts summed together, the women to men total ratio is about 1.1 to 1.0, while the parallel science and engineering totals ratio is about 1.0 to 1.5. Therefore, for both the state median ratios and the total sum ratios, women have more degrees in general, but the relative difference between number of male and female science and engineering degrees is even greater and in the opposite relationship.

In [10]:
median(subset$percentMaleTotal)
median(subset$percentFemaleTotal)

median(subset$percentOfMaleInSciEng )
median(subset$percentOfFemaleInSciEng)

median(subset$percentSciEngMale)
median(subset$percentSciEngFemale)
median(subset$percentSciEngMale)/median(subset$percentSciEngFemale)

median(subset$percentSciEngRelatedMale)/median(subset$percentSciEngRelatedFemale) 

median(subset$percentBusinessMale)/median(subset$percentBusinessFemale) 

median(subset$percentEducationMale)/median(subset$percentEducationFemale) 

median(subset$percentHumanitiesMale)/median(subset$percentHumanitiesFemale)
47.5113516955763
52.4886483044237
60.6014516113314
39.3985483886686
42.3009988693081
24.1244107783942
1.75345210533361
0.413230746804793
1.45666628645974
0.344464288170358
0.858303537506162

Looking at the percentage calculations, women had about 52.5% of all bachelor's degrees in 2015 for those over the age of 25. Yet they only held about 39.4% of science and engineering degrees. In line with the ratios examined just previously, men had a stronger inclination to go into the sciences and engineering than women. On average, 42.3% of all degrees were science and engineering based for men, while only 24.1% of degrees for women were science and engineering degrees on average. The ratio of those percentages was 1.75, which was the largest of all the college majors, with the business major having the second greatest at 1.46. The discrepancy between total number of degrees, state averages, and state average percentages all point towards there being some societal trend for women to not enter the sciences or engineering. One paper from 2009 that analyzed a survey of 161 students from Norhtwestern University determined through econometric modeling based on survey results that the most significant reason for women deciding not to enter particular majors was linked to expectations of enjoyment of the coursework. The author suggested that the difference in expectations between men and women for particular department courses might be linked to gender discrimination in society (Zafar 29). In an article from 1984 written using data from the National Longitudinal Studies of the High School Class of 1972, the authors argued that "substantial differences appear in their preferences as of [the students'] senior year in high school, for various types of work and in their subsequent preparation for the labor market during college" (Daymont 414). Again, an economic regression model based on survey answers determined which factors weighed in most on what most contributed to the gender gap. If the preferences and impressions of students on their career choice impact college major selection and therefore career path and earnings, then the data seen in the American Community Survey from 2015 add support to the idea that the gender gap begins to take form even before students finish their education.

In [12]:
options(scipen=2000000) #converts scientific notation to regular decimals for numbers under two million
summary(lm(as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale)))
Call:
lm(formula = as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale))

Residuals:
   Min     1Q Median     3Q    Max 
-78668  -9063   -421   6422  97610 

Coefficients:
                                  Estimate   Std. Error t value
(Intercept)                   -3391.997658  4208.916304  -0.806
as.numeric(subset$SciEngMale)     0.676498     0.009741  69.447
                                         Pr(>|t|)    
(Intercept)                                 0.424    
as.numeric(subset$SciEngMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23820 on 50 degrees of freedom
Multiple R-squared:  0.9897,	Adjusted R-squared:  0.9895 
F-statistic:  4823 on 1 and 50 DF,  p-value: < 0.00000000000000022

The code above outputs a linear regression summary for the raw count of male held science and engineering degrees acting on the raw count of female held science and engineering degrees for each state, DC, and Puerto Rico. The same summary output is then performed for the four other types of majors. The important summary statistics that give evidence to support how women systematically avoid majoring in science or engineering are analyzed further on.

In [13]:
summary(lm(as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale)))
Call:
lm(formula = as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale))

Residuals:
   Min     1Q Median     3Q    Max 
-30447  -5203  -2189   6499  34268 

Coefficients:
                                       Estimate Std. Error t value
(Intercept)                          7139.04527 2119.41889   3.368
as.numeric(subset$SciEngRelatedMale)    2.32544    0.04122  56.413
                                                 Pr(>|t|)    
(Intercept)                                       0.00146 ** 
as.numeric(subset$SciEngRelatedMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11470 on 50 degrees of freedom
Multiple R-squared:  0.9845,	Adjusted R-squared:  0.9842 
F-statistic:  3182 on 1 and 50 DF,  p-value: < 0.00000000000000022
In [14]:
summary(lm(as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale)))
Call:
lm(formula = as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale))

Residuals:
   Min     1Q Median     3Q    Max 
-42580  -3227    412   2940  56153 

Coefficients:
                                   Estimate  Std. Error t value
(Intercept)                     -1341.27859  2532.33558   -0.53
as.numeric(subset$BusinessMale)     0.80450     0.01122   71.68
                                           Pr(>|t|)    
(Intercept)                                   0.599    
as.numeric(subset$BusinessMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13810 on 50 degrees of freedom
Multiple R-squared:  0.9904,	Adjusted R-squared:  0.9902 
F-statistic:  5138 on 1 and 50 DF,  p-value: < 0.00000000000000022
In [15]:
summary(lm(as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale)))
Call:
lm(formula = as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale))

Residuals:
   Min     1Q Median     3Q    Max 
-62856  -8567  -1395   7101  82946 

Coefficients:
                                    Estimate  Std. Error t value
(Intercept)                      -4432.65497  4627.01232  -0.958
as.numeric(subset$EducationMale)     3.35557     0.08909  37.666
                                            Pr(>|t|)    
(Intercept)                                    0.343    
as.numeric(subset$EducationMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22270 on 50 degrees of freedom
Multiple R-squared:  0.966,	Adjusted R-squared:  0.9653 
F-statistic:  1419 on 1 and 50 DF,  p-value: < 0.00000000000000022
In [16]:
summary(lm(as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale)))
Call:
lm(formula = as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale))

Residuals:
   Min     1Q Median     3Q    Max 
-60060  -6516   2465   6738  59227 

Coefficients:
                                     Estimate  Std. Error t value
(Intercept)                       -8813.81804  2759.08699  -3.194
as.numeric(subset$HumanitiesMale)     1.39733     0.01402  99.692
                                              Pr(>|t|)    
(Intercept)                                    0.00243 ** 
as.numeric(subset$HumanitiesMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15380 on 50 degrees of freedom
Multiple R-squared:  0.995,	Adjusted R-squared:  0.9949 
F-statistic:  9938 on 1 and 50 DF,  p-value: < 0.00000000000000022

The dependency of the number of female degree holders for each major on the number of male degree holders is positive for each of the five majors. Looking at the raw counts of bachelor's degree holders, this result is not surprising. A rising amount of degrees for each gender implies that states with more degree holders for one gender have relatively more for the other as well. However, the slope with the lowest magnitude was the science and engineering major relationship at 0.67, while the other slopes were 0.80, 1.40, 2.33, and 3.36. Based on the p-values of each regression being negligibly close to zero, the regression coefficients are all statistically significant.

In [17]:
ggplot(subset, aes(as.numeric(SciEngMale), as.numeric(SciEngFemale)))  +
      geom_point()+
      #scale_x_continuous(name="Total Male SciEng Degree Holders", limits=c(0, 150000)) +
      #scale_y_continuous(name="Total Female SciEng Degree Holders", limits=c(0, 150000))+
      labs(x= "Total Male SciEng Degree Holders by State") +
      labs(y = "Total Female SciEng Degree Holders by State")+ ylim(0,2000000)+
      labs(title= "Relationship Between Male and Female SciEng Degree Holders") + 
      stat_smooth(method = lm, se = FALSE, color = "black") +
      geom_vline(xintercept = 134349, linetype="dotted", colour="red")+
      geom_hline(yintercept =  84527, linetype="dotted", colour="red")+
      geom_vline(xintercept = 0)+
      geom_hline(yintercept = 0)+
      annotate("text", label = "r^2 == 0.9895", parse = TRUE,x= 1400000, y = 1500000) +
      annotate("text", label = "slope = 0.676634", x= 1475000, y = 1250000)



#qplot(as.numeric(SciEngMale), as.numeric(SciEngFemale), data = subset, color = I("darkblue"),
#      xlab = "Total Male SciEng Degree Holders", ylab = "Total Female SciEng Degree Holders", 
#      main = "Relationship Between Total Male and Female SciEng Degree Holders") + geom_smooth(method = "lm", se = FALSE)
#qplot version of the above ggplot

The ggplot above is a visual representation of the first linear regression run in the series of five regressions run earlier. The x-axis represents the total number of male science or engineering bachelor’s degree holders aged twenty-five and older from each of the 50 states, DC, and Puerto Rico. The y-axis represents the same figure for women. The median number of male degree holders with a major in science or engineering was 134,349, while the median for females was 84,527. The dotted line crosshair intercept indicates this point. Also, the r-squared value indicates that almost ninety-nine percent of the variability in female science and engineering degree holders is accounted for by variability in the number of male science or engineering degree holders. This value is incredibly high for a regression with just one independent variable, but looking at the other r-squared values for the regressions on other college major categories, it can be seen that similarly high values are present. Outside of data distortion possibilities, this indicates that the number of male degree holders in a state for a particular major for the year of 2015 was an incredibly precise indicator of how many female degree holders for that major there will be.

In [18]:
summary(lm(subset$percentSciEngFemale ~ subset$percentSciEngMale))
ggplot(subset, aes(percentSciEngMale, percentSciEngFemale))  +
      geom_point()+
      labs(x= "Male SciEng Degree Holders by State") + xlim(32.5,52.5)+ 
      labs(y = "Percentage of Female SciEng Degree Holders by State")+ ylim(15.25,35.25)+
      labs(title= "Relationship Between Male and Female SciEng Degree Holders by Percentage") + 
      stat_smooth(method = lm, se = FALSE, color = "black")
Call:
lm(formula = subset$percentSciEngFemale ~ subset$percentSciEngMale)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9425 -1.5519  0.5676  1.1299  7.3309 

Coefficients:
                          Estimate Std. Error t value            Pr(>|t|)    
(Intercept)              -17.19788    3.77537  -4.555 0.00003383556251666 ***
subset$percentSciEngMale   0.99056    0.08823  11.227 0.00000000000000284 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.589 on 50 degrees of freedom
Multiple R-squared:  0.716,	Adjusted R-squared:  0.7103 
F-statistic: 126.1 on 1 and 50 DF,  p-value: 0.000000000000002836
Warning message:
"Removed 1 rows containing non-finite values (stat_smooth)."Warning message:
"Removed 1 rows containing missing values (geom_point)."

Looking at the percentage of degree holders who majored in science and engineering for each state, as opposed to the raw counts, a relatively one-to-one slope is seen, as computed in the regression above with a value of about 0.99. This relationship may at first appear to conflict with the interpretation of the raw counts, but due to the difference in observation values, the slope actually further supports the notion that the number of women in science and engineering is significantly lower than men. For every one percent point increase in the percentage of bachelor's degree holders who majored in science and engineering, the percentage of women degree holders with a major in science and engineering is expected to increase by one percent point as well. However, the range of percentage values for women, as seen by the y-axis, are all lower by around the 18% difference seen in the median state percentage values for the two genders, where the median for men was 42.3% for men and 24.1% for women. Importantly, this relationship indicates that states with higher percentages for both men and women having science degrees have more similar percentages than those states with lower figures, due to higher numerator and denominator values meaning a fraction closer to the value of one. This will be more visually apparent later on by taking the ratio of these two variables and mapping the state values with a choropleth.

Initializing state longitude and latitude data and creating choropleth graphs will help to get a better sense of the regional implications of these findings.

In [19]:
states <- map_data("state")
head(states)
dim(states)
head(subset)
longlatgrouporderregionsubregion
-87.4620130.38968 1 1 alabama NA
-87.4849330.37249 1 2 alabama NA
-87.5250330.37249 1 3 alabama NA
-87.5307630.33239 1 4 alabama NA
-87.5708730.32665 1 5 alabama NA
-87.5880630.32665 1 6 alabama NA
  1. 15537
  2. 6
StateTotalTotalMaleTotalFemaleSciEngTotalSciEngMaleSciEngFemaleSciEngRelatedTotalSciEngRelatedMaleSciEngRelatedFemale...percentSciEngFemalepercentSciEngRelatedMalepercentSciEngRelatedFemalepercentBusinessMalepercentBusinessFemalepercentEducationMalepercentEducationFemalepercentHumanitiesMalepercentHumanitiesFemaleSciEngRatio
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 28.93829 5.052843 13.16977 18.57819 13.81946 9.125481 19.83150 23.20799 24.24098 0.6571582
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 24.19462 5.564647 13.34540 25.39499 17.93601 7.067840 21.91087 19.69598 22.61310 0.5722941
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 20.27850 4.650623 14.13067 27.88357 17.01526 8.265142 28.54256 19.44701 20.03302 0.5101041
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 32.30986 5.075628 10.83265 20.86023 16.93233 3.194441 10.06075 21.37669 29.86442 0.6528166
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 30.24014 5.223841 11.46444 23.05031 16.80261 5.171365 15.04361 19.60642 26.44920 0.6441191
In [20]:
names(subset) <- tolower(names(subset))
subset$region <- tolower(subset$state)
head(subset)
statetotaltotalmaletotalfemalesciengtotalsciengmalesciengfemalesciengrelatedtotalsciengrelatedmalesciengrelatedfemale...percentsciengrelatedmalepercentsciengrelatedfemalepercentbusinessmalepercentbusinessfemalepercenteducationmalepercenteducationfemalepercenthumanitiesmalepercenthumanitiesfemalesciengratioregion
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188 alabama
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 5.052843 13.16977 18.57819 13.81946 9.125481 19.83150 23.20799 24.24098 0.6571582 alaska
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 5.564647 13.34540 25.39499 17.93601 7.067840 21.91087 19.69598 22.61310 0.5722941 arizona
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 4.650623 14.13067 27.88357 17.01526 8.265142 28.54256 19.44701 20.03302 0.5101041 arkansas
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 5.075628 10.83265 20.86023 16.93233 3.194441 10.06075 21.37669 29.86442 0.6528166 california
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 5.223841 11.46444 23.05031 16.80261 5.171365 15.04361 19.60642 26.44920 0.6441191 colorado

The above code modifies the original "subset" data by renaming the columns with lowercase titles and adds a final column named "region" that the state data shares with the given state as each entry.

In [21]:
choro_df <- merge(states, subset, by = "region") #merge(df1,df2,by="column vector")
head(choro_df)
regionlonglatgroupordersubregionstatetotaltotalmaletotalfemale...percentsciengfemalepercentsciengrelatedmalepercentsciengrelatedfemalepercentbusinessmalepercentbusinessfemalepercenteducationmalepercenteducationfemalepercenthumanitiesmalepercenthumanitiesfemalesciengratio
alabama -87.4620130.38968 1 1 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.4849330.37249 1 2 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5250330.37249 1 3 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5307630.33239 1 4 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5708730.32665 1 5 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5880630.32665 1 6 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188

After creating a column that matches both dataframes, they can be merged with the "merge" command and then ordered by the order column.

The next two maps entitled "Women Degree Holders with a Major in Science/Engineering" and "Men Degree Holders with a Major in Science/Engineering" display the findings of the raw count statistical regression analysis conducted earlier. States with more men with science and engineering degrees also have more women with science and engineering degrees. The legend next to each will indicate the relative gap between the two genders in number of degrees. In a regional context, because of the strong direct connection between the two counts of science and engineering degrees, the maps look very similar.

In [22]:
choro <- choro_df[order(choro_df$order),]  #order by "order" column
head(choro)
regionlonglatgroupordersubregionstatetotaltotalmaletotalfemale...percentsciengfemalepercentsciengrelatedmalepercentsciengrelatedfemalepercentbusinessmalepercentbusinessfemalepercenteducationmalepercenteducationfemalepercenthumanitiesmalepercenthumanitiesfemalesciengratio
alabama -87.4620130.38968 1 1 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.4849330.37249 1 2 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5250330.37249 1 3 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5307630.33239 1 4 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5708730.32665 1 5 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5880630.32665 1 6 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
In [23]:
choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1400000, by = 100000), include.lowest = TRUE, 
                    labels = c("0-100,000","100,001-200,000","200,001-300,000","300,001-400,000","400,001-500,000",
                              "500,001-600,000","600,001-700,000","700,001-800,000","800,001-900,000","900,001-1,000,000",
                              "1,000,001-1,100,000","1,100,001-1,200,000","1,200,001-1,300,000","1,300,001-1,400,000"))
                              
#choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1500000, by = 250000), include.lowest = TRUE, 
#                    labels = c("0-250,000","250,000-500,000","250,000-500,000",
#                              "500,000-750,000","750,000-1,000,000","1,000,000-1,250,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Women Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Reds")

For the raw degree count map for women above, break separation units of 100,000 were used, while break separation units of 200,000 were used for the men. Even with the difference in intervals, the trend of all states having about 1.5 times as many degrees in the sciences for men is apparent through the coloration similarities.

In [129]:
choro$breaks <- cut(as.numeric(choro$sciengmale),breaks = seq(0,2200000, by = 200000), include.lowest = TRUE, 
                    labels = c("0-200,000","200,001-400,000","400,001-600,000","600,001-800,000","800,001-1,000,000",
                              "1,000,001-1,200,000","1,200,001-1,400,000","1,400,001-1,600,000","1,600,001-1,800,000","1,800,001-2,000,000",
                              "2,000,001-2,200,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Men Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Blues")

While the raw counts suggest that there is no regional difference in degree counts for men and women in the sciences, looking at the percentage of degrees for each state that are in science or engineering paints a different picture. Comparing the two, some parts of the country have more men in the sciences than they do women and vice versa. Some states do have high percentages for both, like California and New York, but others only have high percentages for men and relatively lower for women like Wyoming and Florida. Aside from the regional differences, the maps do further the argument that more men are in the sciences than women. The degree rates are significantly lower for women, as shown earlier by the median percentage of men and women with science and engineering degrees.

In [25]:
choro$breaks <- cut(choro$percentsciengfemale,breaks = seq(15,45, by = 5), include.lowest = TRUE, 
                    labels = c("15%-20%","20%-25%","25%-30%",
                              "30%-35%","35%-40%","40%-45%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Percentage of Women Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Degree Rates", palette = "Reds")
In [26]:
choro$breaks <- cut(choro$percentsciengmale,breaks = seq(30,55, by = 5), include.lowest = TRUE, 
                    labels = c("30%-35%","35%-40%","40%-45%",
                              "45%-50%","50%-55%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Percentage of Men Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Degree Rates", palette = "Blues")

Taking the ratio of the percentage of women with science and engineering degrees compared to other degrees to the percentage of men with science and engineering degrees, we can see from just one, rather than two, maps how region impacts the relative rates of science and engineering degrees for men and women. Darker states indicate higher ratios women, but the ratio never reaches the value of one. A general trend is that the west and east coasts have higher ratios than the rest of the country. Unlike the first two maps that displayed very similar regional patterns, this ratio map distinctly displays how different parts of the country have different magnitudes of college major gender bias.

In [27]:
choro$breaks <- cut(choro$percentsciengfemale/choro$percentsciengmale,breaks = seq(0.35,0.85,by = 0.05), include.lowest = TRUE, 
                    labels = c("0.35-0.40","0.40-0.45","0.45-0.50","0.50-0.55",
                              "0.55-0.60","0.60-0.65","0.65-0.70","0.70-0.75","0.75-0.80","0.80-0.85"))
#choro$breaks = cut(choro$percentsciengfemale/choro$percentsciengmale, 6)
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Ratio of Percentages for Women to Men with Major in Science/Engineering") + 
    scale_fill_brewer(name = "Ratio of Percents",
                       palette = "Purples")
In [69]:
desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)  #re-inputting the origional "subset" dataframe not changed by the choropleth merging
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #while this pair has no immediate, direct relationship
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale   

head(subset)
StateTotalTotalMaleTotalFemaleSciEngTotalSciEngMaleSciEngFemaleSciEngRelatedTotalSciEngRelatedMaleSciEngRelatedFemale...BusinessFemaleEducationTotalEducationMaleEducationFemaleHumanitiesTotalHumanitiesMaleHumanitiesFemalepercentSciEngMalepercentSciEngFemaleSciEngRatio
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 81581 135842 27350 108492 161912 70769 91143 40.00344 20.26249 0.5065188
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 9891 20385 6191 14194 33095 15745 17350 44.03549 28.93829 0.6571582
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 114068 183272 43925 139347 266219 122406 143813 42.27654 24.19462 0.5722941
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 39993 83480 16393 67087 85657 38571 47086 39.75365 20.27850 0.5101041
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 726846 563581 131708 431873 2163345 881369 1281976 49.49301 32.30986 0.6528166
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 123617 147138 36462 110676 332827 138240 194587 46.94806 30.24014 0.6441191
In [74]:
sort(as.numeric(subset$SciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngMale == "2040615"), "State"]
subset[which(subset$SciEngMale == "1074765"), "State"]
subset[which(subset$SciEngMale == "912146"), "State"]

sort(as.numeric(subset$SciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngFemale == "1386950"), "State"]
subset[which(subset$SciEngFemale == "711283"), "State"]
subset[which(subset$SciEngFemale == "645017"), "State"]
  1. 2040615
  2. 1074765
  3. 912146
  4. 797404
  5. 561926
  6. 528311
  7. 505464
  8. 497992
  9. 440726
  10. 419946
'California'
'Texas'
'New York'
  1. 1386950
  2. 711283
  3. 645017
  4. 486289
  5. 370482
  6. 357891
  7. 356819
  8. 346375
  9. 340110
  10. 295373
'California'
'New York'
'Texas'

By sorting the states with the highest degree count for the sciences, it is possible to indentify which states these counts belong to using the "which" function. California has the most male and female science and engineering major degree holders. New York and Texas also have the next highest counts of both.

In [75]:
sort(as.numeric(subset$percentSciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngMale == "51.9724320009824"), "State"]
subset[which(subset$percentSciEngMale == "50.63696225976"), "State"]
subset[which(subset$percentSciEngMale == "49.9569814248343"), "State"]

sort(as.numeric(subset$percentSciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngFemale == "41.6149338278437"), "State"]
subset[which(subset$percentSciEngFemale == "33.1419517414361"), "State"]
subset[which(subset$percentSciEngFemale == "32.9437484466618"), "State"]
  1. 51.9724320009824
  2. 50.63696225976
  3. 49.9569814248343
  4. 49.7663620413637
  5. 49.4930072177378
  6. 47.6220113737173
  7. 47.3557411283058
  8. 47.1133741667282
  9. 46.9480551714357
  10. 46.0928226636069
'District of Columbia'
'Washington'
'Maryland'
  1. 41.6149338278437
  2. 33.1419517414361
  3. 32.9437484466618
  4. 32.3098559329161
  5. 32.3026458001363
  6. 31.5553384998919
  7. 30.3424863564093
  8. 30.2401383170609
  9. 29.9560656325652
  10. 29.6273939162456
'District of Columbia'
'Massachusetts'
'Virginia'

Washington DC has the highest percentage of male and female science and engineering degree holders. Washington (state) and Maryland have high concentration for men, while Massachusetts and Virginia have high levels for women.

In [127]:
subset[which(subset$percentSciEngMale > 46), "State"]
subset[which(as.numeric(subset$SciEngMale) > 410000), "State"]

order(subset$percentSciEngMale)[42:52]
order(as.numeric(subset$SciEngMale))[42:52]

subset[which(subset$percentSciEngFemale > 29), "State"]
subset[which(as.numeric(subset$SciEngFemale) > 290000), "State"]

order(subset$percentSciEngFemale)[42:52]
order(as.numeric(subset$SciEngFemale))[42:52]
order(subset$SciEngRatio)[42:52]
  1. 'California'
  2. 'Colorado'
  3. 'District of Columbia'
  4. 'Maryland'
  5. 'Massachusetts'
  6. 'New Hampshire'
  7. 'Oregon'
  8. 'Virginia'
  9. 'Washington'
  10. 'Wyoming'
  1. 'California'
  2. 'Florida'
  3. 'Illinois'
  4. 'Massachusetts'
  5. 'New Jersey'
  6. 'New York'
  7. 'Pennsylvania'
  8. 'Texas'
  9. 'Virginia'
  10. 'Washington'
  1. 32
  2. 51
  3. 6
  4. 30
  5. 38
  6. 22
  7. 5
  8. 47
  9. 21
  10. 48
  11. 9
  1. 36
  2. 48
  3. 22
  4. 31
  5. 47
  6. 39
  7. 14
  8. 10
  9. 33
  10. 44
  11. 5
  1. 'California'
  2. 'Colorado'
  3. 'District of Columbia'
  4. 'Maryland'
  5. 'Massachusetts'
  6. 'New Jersey'
  7. 'Oregon'
  8. 'Vermont'
  9. 'Virginia'
  10. 'Washington'
  1. 'California'
  2. 'Florida'
  3. 'Illinois'
  4. 'Massachusetts'
  5. 'New Jersey'
  6. 'New York'
  7. 'North Carolina'
  8. 'Pennsylvania'
  9. 'Texas'
  10. 'Virginia'
  1. 2
  2. 31
  3. 46
  4. 6
  5. 38
  6. 48
  7. 21
  8. 5
  9. 47
  10. 22
  11. 9
  1. 21
  2. 34
  3. 22
  4. 39
  5. 31
  6. 47
  7. 14
  8. 10
  9. 44
  10. 33
  11. 5
  1. 6
  2. 21
  3. 5
  4. 46
  5. 2
  6. 47
  7. 31
  8. 8
  9. 33
  10. 22
  11. 9

Looking at the state names and their row entry numbers through the "which" and "order" function, the states with the highest concentrations of science and engineering majors for both men and women do not necessarily match with the states with the highest degree counts. Likewise, the states with the higher female to male percentage ratios do not necessarily match with the states with the highest female percentage values. What these figures indicate is that although there appears to be a relatively constant 3:2 ratio of male to female science and engineering degree totals across the country, the percentages of men and women in the sciences at the state level is not nearly as consistent. Therefore, the level of gender disparity in the sciences depends on the region of the country. Note that the "which" commands are listing states alphabetically, while the order command is listing states with the desired units in ascending order, which is why the row counts of 42 to 52 are used for indexing the largest values.

In [30]:
summary(lm(log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal))))
Call:
lm(formula = log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal)))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35757 -0.07210 -0.01002  0.07399  0.34999 

Coefficients:
                                    Estimate Std. Error t value    Pr(>|t|)    
(Intercept)                         -1.02118    0.18148  -5.627 0.000000827 ***
log(as.numeric(subset$SciEngTotal))  0.03825    0.01454   2.630      0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1118 on 50 degrees of freedom
Multiple R-squared:  0.1215,	Adjusted R-squared:  0.104 
F-statistic: 6.918 on 1 and 50 DF,  p-value: 0.01131

The regression above serves to establish that the ratio map is not strongly coorelated with total number of science and engineering degrees per state. This was an area of concern due to the seemingly similar appearance of the science and engineering major degree count maps and the ratio map. If the ratio map was simply suggesting states with more science and engineering degree holders had higher percentages of women in the sciences compared to the percentage of men in the sciences, this might have indicated nothing more than that more degree holders equates to more equitable percentages. The log was taken of both the dependent variable, the science and engineering major gender ratio, and the independent variable of total number of degrees per state, to determine what a percentage change in the total number of degrees would have on a percentage change in the ratio. Otherwise, the unit differences would have produced meaningless statistics. Having a slope of just 0.03825 indicates that there is almost no relationship between the number of degrees in a state and how much of a gender bias there is in the percentages of men and women majoring in the sciences and engineering. A one percent increase in the number of degrees in a state will only have about a 3.1% change on the ratio.

There is not a lot of literature that analyzes region, college major choice, and gender together in the same context. However, given that preferences are a signficant factor in why women choose not to major in fields like engineering, the next step in incorporating the findings of this presentation is bringing regional influence into the narrative by claiming that certain parts of the country create environments for women to have more positive impressions of the sciences. In his Federal Reserve Bank of New York Staff Report, Basit Zafar found through his econometric model mentioned earlier that "60% of the gender gap in engineering is due to differences in preferences, while 30% is due to differences in how much females and males believe they will enjoy studying engineering" (Zafar 4). There are other explanations offered for why students choose one major over another including a strong emphasis on the connection between major choice and political ideology. One 2006 paper found that "liberal students [are] more likely to choose a non-science major" (Porter 2006). This explanation seems to run counter to the findings that the typically more liberal coastal United States regions have high figures for percentage of science degrees compared to total degrees for both men and women. However, the survey used for Porter and Umbach's paper only tested one highly selective liberal arts college and agknowledged that the results cannot be extrapolated to a larger sample of students from different types of schools. Similarly, the 2015 data examined in this paper is a single snapshot in time of the relationship between gender, major, and region. Further temporal analysis should be considered to determine the changing landscape of gender imbalance in science and engineering major selection.

The complexities of gender gap analysis go beyond data limitations. The chosen scope of the inspection inherently changes the range of possible interpretations. In a study that examined not only gender but also socioeconomic status (SES), Ma found "that women from lower SES backgrounds are as likely as their male counterparts to choose a lucrative college major" and "the role of lucrative college major choice in potentially uplifting students’ and their families’ SES outweighs the traditional gender role socialization that contributes to the divergent career paths toward which men and women are oriented" (228 Ma). In a paper on citizenship status, the author found "a higher propensity to enroll in SEM [Science, Engineering and Math] fields for foreign-born populations and a lower propensity to enroll in social sciences compared to citizens" (Nores 138). In order to completely disagregate all of the possible effects on the gender gap in college major choice, all concievable variables would have to be included in the analysis.

Although the spatial maps and ratios calculated suggest that different parts of the country experience different magnitudes of gender bias in college major choice, the ability to prove geographic cause is not within the scope of this presentation. However, if government policies, educational backgrounds, or cultural differences are attached to region, then the analysis conducted may be a starting point in identifying why varying levels of women across the country are systematically choosing not to go into the sciences or engineering during their undergraduate careers. Additionally, college students do not necessarily come from the same state they study in. State bias might then indicate quality discrepancies in academic institutions in particular states rather than any gender equality differences. Better schools might have more resources for scientific and engineering research. What can be said for certain based on the data of the US Census Bureau American Community Survey is that there does exist a reason why women are not entering the sciences and engineering at the same rate as men and there is at least an indirect relationship between region and level of gender disparity.

Bibliography

Daymont, Thomas N., and Paul J. Andrisani. "Job Preferences, College Major, and the Gender Gap in Earnings." The Journal of Human Resources 19, no. 3 (1984): 408-28. doi:10.2307/145880.

Ma, Yingyi. "Family Socioeconomic Status, Parental Involvement, and College Major Choices—Gender, Race/Ethnic, and Nativity Patterns." Sociological Perspectives 52, no. 2 (2009): 211-34. doi:10.1525/sop.2009.52.2.211.

Nores, Milagros. "Differences in College Major Choice by Citizenship Status." The Annals of the American Academy of Political and Social Science 627 (2010): 125-41. http://www.jstor.org/stable/40607409.

Porter, Stephen R., and Paul D. Umbach. "College Major Choice: An Analysis of Person–Environment Fit." Research in Higher Education 47, no. 4 (2006): 429-49. doi:10.1007/s11162-005-9002-3

United States Census Bureau. (2015). American Community Survey [bachelors.csv]. Retrieved from https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Zafar, Basit. "College Major Choice and the Gender Gap." SSRN Electronic Journal (2013): 1-50. doi:10.2139/ssrn.1348219.