Jason Greenberg

Research Question: Starting as early as their undergraduate career, do women face barriers to entry into or have strong preferences against entering STEM related fields, specifically the sciences and engineering, due to their gender? Also, does region within the United States impact this gender disparity?

In [1]:

```
library(ggplot2)
library(maps)
library(RColorBrewer)
```

These r packages above are necessary to run the graphics that will support the argument developed.

In [2]:

```
bachelors <- read.csv("bachelors.csv", header = TRUE, stringsAsFactors = FALSE)
dim(bachelors)
head(bachelors)
```

In [3]:

```
desired_columns <- c(3, 4, 16, 28, 40, 52, 64)
desired_rows <- seq(2,53) #all states, Washington DC, and Puerto Rico
subsetTotal <- bachelors[desired_rows, desired_columns]
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
dim(subsetTotal)
subsetTotal
```

In [5]:

```
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
bandNames <- colnames(subsetTotal[,3:7])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
for(j in 1:5){
hist(as.numeric(subsetTotal[,j+2])/as.numeric(subsetTotal[,2]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
box()
text(x = .33, y=40, label = bandNames[j])
}
```

In [6]:

```
Desired_columnsMale <- c(8, 20, 32, 44, 56, 68) #men totals
Desired_rowsMale <- seq(2,53) #all states
SubsetMale <- bachelors[Desired_rowsMale, Desired_columnsMale]
colnames(SubsetMale) <- c("TotalMale", "SciEngMale", "SciEngRelatedMale", "BusinessMale",
"EducationMale", "HumArtsMale")
bandNames <- colnames(SubsetMale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
for(j in 1:5){
hist(as.numeric(SubsetMale[,j+1])/as.numeric(SubsetMale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
box()
text(x = .33, y=40, label = bandNames[j])
}
Desired_columnsFemale <- c(12, 24, 36, 48, 60, 72) #women totals
Desired_rowsFemale <- seq(2,53) #all states
SubsetFemale <- bachelors[Desired_rowsFemale, Desired_columnsFemale]
colnames(SubsetFemale) <- c("TotalFemale", "SciEngFemale", "SciEngRelatedFemale", "BusinessFemale",
"EducationFemale", "HumArtsFemale")
bandNames <- colnames(SubsetFemale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
for(j in 1:5){
hist(as.numeric(SubsetFemale[,j+1])/as.numeric(SubsetFemale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
box()
text(x = .33, y=40, label = bandNames[j])
}
```

In [128]:

```
desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns]
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
"SciEngTotal", "SciEngMale", "SciEngFemale",
"SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
"BusinessTotal","BusinessMale","BusinessFemale",
"EducationTotal","EducationMale","EducationFemale",
"HumanitiesTotal","HumanitiesMale","HumanitiesFemale")
subset$percentMaleTotal <- as.numeric(subset$TotalMale)/as.numeric(subset$Total)*100
subset$percentFemaleTotal <- as.numeric(subset$TotalFemale)/as.numeric(subset$Total)*100
subset$percentOfMaleInSciEng <- as.numeric(subset$SciEngMale)/as.numeric(subset$SciEngTotal)*100 #this pair of percentages will sum to 100%
subset$percentOfFemaleInSciEng <- as.numeric(subset$SciEngFemale)/as.numeric(subset$SciEngTotal)*100
subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100 #while this pair has no immediate, direct relationship
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100
subset$percentSciEngRelatedMale <- as.numeric(subset$SciEngRelatedMale)/as.numeric(subset$TotalMale)*100
subset$percentSciEngRelatedFemale <- as.numeric(subset$SciEngRelatedFemale)/as.numeric(subset$TotalFemale)*100
subset$percentBusinessMale <- as.numeric(subset$BusinessMale)/as.numeric(subset$TotalMale)*100
subset$percentBusinessFemale <- as.numeric(subset$BusinessFemale)/as.numeric(subset$TotalFemale)*100
subset$percentEducationMale <- as.numeric(subset$EducationMale)/as.numeric(subset$TotalMale)*100
subset$percentEducationFemale <- as.numeric(subset$EducationFemale)/as.numeric(subset$TotalFemale)*100
subset$percentHumanitiesMale <- as.numeric(subset$HumanitiesMale)/as.numeric(subset$TotalMale)*100
subset$percentHumanitiesFemale <- as.numeric(subset$HumanitiesFemale)/as.numeric(subset$TotalFemale)*100
subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale
head(subset)
```

In [8]:

```
median(as.numeric(subset$TotalMale))
median(as.numeric(subset$TotalFemale))
median(as.numeric(subset$SciEngMale))
median(as.numeric(subset$SciEngFemale))
```

In [9]:

```
sum(as.numeric(subset$TotalMale))
sum(as.numeric(subset$TotalFemale))
sum(as.numeric(subset$SciEngMale))
sum(as.numeric(subset$SciEngFemale))
```

In [10]:

```
median(subset$percentMaleTotal)
median(subset$percentFemaleTotal)
median(subset$percentOfMaleInSciEng )
median(subset$percentOfFemaleInSciEng)
median(subset$percentSciEngMale)
median(subset$percentSciEngFemale)
median(subset$percentSciEngMale)/median(subset$percentSciEngFemale)
median(subset$percentSciEngRelatedMale)/median(subset$percentSciEngRelatedFemale)
median(subset$percentBusinessMale)/median(subset$percentBusinessFemale)
median(subset$percentEducationMale)/median(subset$percentEducationFemale)
median(subset$percentHumanitiesMale)/median(subset$percentHumanitiesFemale)
```

In [12]:

```
options(scipen=2000000) #converts scientific notation to regular decimals for numbers under two million
summary(lm(as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale)))
```

In [13]:

```
summary(lm(as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale)))
```

In [14]:

```
summary(lm(as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale)))
```

In [15]:

```
summary(lm(as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale)))
```

In [16]:

```
summary(lm(as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale)))
```

In [17]:

```
ggplot(subset, aes(as.numeric(SciEngMale), as.numeric(SciEngFemale))) +
geom_point()+
#scale_x_continuous(name="Total Male SciEng Degree Holders", limits=c(0, 150000)) +
#scale_y_continuous(name="Total Female SciEng Degree Holders", limits=c(0, 150000))+
labs(x= "Total Male SciEng Degree Holders by State") +
labs(y = "Total Female SciEng Degree Holders by State")+ ylim(0,2000000)+
labs(title= "Relationship Between Male and Female SciEng Degree Holders") +
stat_smooth(method = lm, se = FALSE, color = "black") +
geom_vline(xintercept = 134349, linetype="dotted", colour="red")+
geom_hline(yintercept = 84527, linetype="dotted", colour="red")+
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)+
annotate("text", label = "r^2 == 0.9895", parse = TRUE,x= 1400000, y = 1500000) +
annotate("text", label = "slope = 0.676634", x= 1475000, y = 1250000)
#qplot(as.numeric(SciEngMale), as.numeric(SciEngFemale), data = subset, color = I("darkblue"),
# xlab = "Total Male SciEng Degree Holders", ylab = "Total Female SciEng Degree Holders",
# main = "Relationship Between Total Male and Female SciEng Degree Holders") + geom_smooth(method = "lm", se = FALSE)
#qplot version of the above ggplot
```

In [18]:

```
summary(lm(subset$percentSciEngFemale ~ subset$percentSciEngMale))
ggplot(subset, aes(percentSciEngMale, percentSciEngFemale)) +
geom_point()+
labs(x= "Male SciEng Degree Holders by State") + xlim(32.5,52.5)+
labs(y = "Percentage of Female SciEng Degree Holders by State")+ ylim(15.25,35.25)+
labs(title= "Relationship Between Male and Female SciEng Degree Holders by Percentage") +
stat_smooth(method = lm, se = FALSE, color = "black")
```

In [19]:

```
states <- map_data("state")
head(states)
dim(states)
head(subset)
```

In [20]:

```
names(subset) <- tolower(names(subset))
subset$region <- tolower(subset$state)
head(subset)
```

In [21]:

```
choro_df <- merge(states, subset, by = "region") #merge(df1,df2,by="column vector")
head(choro_df)
```

After creating a column that matches both dataframes, they can be merged with the "merge" command and then ordered by the order column.

The next two maps entitled "Women Degree Holders with a Major in Science/Engineering" and "Men Degree Holders with a Major in Science/Engineering" display the findings of the raw count statistical regression analysis conducted earlier. States with more men with science and engineering degrees also have more women with science and engineering degrees. The legend next to each will indicate the relative gap between the two genders in number of degrees. In a regional context, because of the strong direct connection between the two counts of science and engineering degrees, the maps look very similar.

In [22]:

```
choro <- choro_df[order(choro_df$order),] #order by "order" column
head(choro)
```

In [23]:

```
choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1400000, by = 100000), include.lowest = TRUE,
labels = c("0-100,000","100,001-200,000","200,001-300,000","300,001-400,000","400,001-500,000",
"500,001-600,000","600,001-700,000","700,001-800,000","800,001-900,000","900,001-1,000,000",
"1,000,001-1,100,000","1,100,001-1,200,000","1,200,001-1,300,000","1,300,001-1,400,000"))
#choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1500000, by = 250000), include.lowest = TRUE,
# labels = c("0-250,000","250,000-500,000","250,000-500,000",
# "500,000-750,000","750,000-1,000,000","1,000,000-1,250,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon",
main = "Women Degree Holders with a Major in Science/Engineering") +
scale_fill_brewer(name = "Number of Degrees", palette = "Reds")
```

In [129]:

```
choro$breaks <- cut(as.numeric(choro$sciengmale),breaks = seq(0,2200000, by = 200000), include.lowest = TRUE,
labels = c("0-200,000","200,001-400,000","400,001-600,000","600,001-800,000","800,001-1,000,000",
"1,000,001-1,200,000","1,200,001-1,400,000","1,400,001-1,600,000","1,600,001-1,800,000","1,800,001-2,000,000",
"2,000,001-2,200,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon",
main = "Men Degree Holders with a Major in Science/Engineering") +
scale_fill_brewer(name = "Number of Degrees", palette = "Blues")
```

In [25]:

```
choro$breaks <- cut(choro$percentsciengfemale,breaks = seq(15,45, by = 5), include.lowest = TRUE,
labels = c("15%-20%","20%-25%","25%-30%",
"30%-35%","35%-40%","40%-45%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon",
main = "Percentage of Women Degree Holders with a Major in Science/Engineering") +
scale_fill_brewer(name = "Degree Rates", palette = "Reds")
```

In [26]:

```
choro$breaks <- cut(choro$percentsciengmale,breaks = seq(30,55, by = 5), include.lowest = TRUE,
labels = c("30%-35%","35%-40%","40%-45%",
"45%-50%","50%-55%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon",
main = "Percentage of Men Degree Holders with a Major in Science/Engineering") +
scale_fill_brewer(name = "Degree Rates", palette = "Blues")
```

In [27]:

```
choro$breaks <- cut(choro$percentsciengfemale/choro$percentsciengmale,breaks = seq(0.35,0.85,by = 0.05), include.lowest = TRUE,
labels = c("0.35-0.40","0.40-0.45","0.45-0.50","0.50-0.55",
"0.55-0.60","0.60-0.65","0.65-0.70","0.70-0.75","0.75-0.80","0.80-0.85"))
#choro$breaks = cut(choro$percentsciengfemale/choro$percentsciengmale, 6)
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon",
main = "Ratio of Percentages for Women to Men with Major in Science/Engineering") +
scale_fill_brewer(name = "Ratio of Percents",
palette = "Purples")
```

In [69]:

```
desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72) #re-inputting the origional "subset" dataframe not changed by the choropleth merging
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns]
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
"SciEngTotal", "SciEngMale", "SciEngFemale",
"SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
"BusinessTotal","BusinessMale","BusinessFemale",
"EducationTotal","EducationMale","EducationFemale",
"HumanitiesTotal","HumanitiesMale","HumanitiesFemale")
subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100 #while this pair has no immediate, direct relationship
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100
subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale
head(subset)
```

In [74]:

```
sort(as.numeric(subset$SciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngMale == "2040615"), "State"]
subset[which(subset$SciEngMale == "1074765"), "State"]
subset[which(subset$SciEngMale == "912146"), "State"]
sort(as.numeric(subset$SciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngFemale == "1386950"), "State"]
subset[which(subset$SciEngFemale == "711283"), "State"]
subset[which(subset$SciEngFemale == "645017"), "State"]
```

In [75]:

```
sort(as.numeric(subset$percentSciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngMale == "51.9724320009824"), "State"]
subset[which(subset$percentSciEngMale == "50.63696225976"), "State"]
subset[which(subset$percentSciEngMale == "49.9569814248343"), "State"]
sort(as.numeric(subset$percentSciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngFemale == "41.6149338278437"), "State"]
subset[which(subset$percentSciEngFemale == "33.1419517414361"), "State"]
subset[which(subset$percentSciEngFemale == "32.9437484466618"), "State"]
```

In [127]:

```
subset[which(subset$percentSciEngMale > 46), "State"]
subset[which(as.numeric(subset$SciEngMale) > 410000), "State"]
order(subset$percentSciEngMale)[42:52]
order(as.numeric(subset$SciEngMale))[42:52]
subset[which(subset$percentSciEngFemale > 29), "State"]
subset[which(as.numeric(subset$SciEngFemale) > 290000), "State"]
order(subset$percentSciEngFemale)[42:52]
order(as.numeric(subset$SciEngFemale))[42:52]
order(subset$SciEngRatio)[42:52]
```

In [30]:

```
summary(lm(log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal))))
```

The complexities of gender gap analysis go beyond data limitations. The chosen scope of the inspection inherently changes the range of possible interpretations. In a study that examined not only gender but also socioeconomic status (SES), Ma found "that women from lower SES backgrounds are as likely as their male counterparts to choose a lucrative college major" and "the role of lucrative college major choice in potentially uplifting students’ and their families’ SES outweighs the traditional gender role socialization that contributes to the divergent career paths toward which men and women are oriented" (228 Ma). In a paper on citizenship status, the author found "a higher propensity to enroll in SEM [Science, Engineering and Math] fields for foreign-born populations and a lower propensity to enroll in social sciences compared to citizens" (Nores 138). In order to completely disagregate all of the possible effects on the gender gap in college major choice, all concievable variables would have to be included in the analysis.

Although the spatial maps and ratios calculated suggest that different parts of the country experience different magnitudes of gender bias in college major choice, the ability to prove geographic cause is not within the scope of this presentation. However, if government policies, educational backgrounds, or cultural differences are attached to region, then the analysis conducted may be a starting point in identifying why varying levels of women across the country are systematically choosing not to go into the sciences or engineering during their undergraduate careers. Additionally, college students do not necessarily come from the same state they study in. State bias might then indicate quality discrepancies in academic institutions in particular states rather than any gender equality differences. Better schools might have more resources for scientific and engineering research. What can be said for certain based on the data of the US Census Bureau American Community Survey is that there does exist a reason why women are not entering the sciences and engineering at the same rate as men and there is at least an indirect relationship between region and level of gender disparity.

Bibliography

Daymont, Thomas N., and Paul J. Andrisani. "Job Preferences, College Major, and the Gender Gap in Earnings." The Journal of Human Resources 19, no. 3 (1984): 408-28. doi:10.2307/145880.

Ma, Yingyi. "Family Socioeconomic Status, Parental Involvement, and College Major Choices—Gender, Race/Ethnic, and Nativity Patterns." Sociological Perspectives 52, no. 2 (2009): 211-34. doi:10.1525/sop.2009.52.2.211.

Nores, Milagros. "Differences in College Major Choice by Citizenship Status." The Annals of the American Academy of Political and Social Science 627 (2010): 125-41. http://www.jstor.org/stable/40607409.

Porter, Stephen R., and Paul D. Umbach. "College Major Choice: An Analysis of Person–Environment Fit." Research in Higher Education 47, no. 4 (2006): 429-49. doi:10.1007/s11162-005-9002-3

United States Census Bureau. (2015). American Community Survey [bachelors.csv]. Retrieved from https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Zafar, Basit. "College Major Choice and the Gender Gap." SSRN Electronic Journal (2013): 1-50. doi:10.2139/ssrn.1348219.