Birth rates in South Korea have reached historically low levels,
posing significant social and economic challenges. Understanding the
factors that influence regional birth rates is essential for developing
effective policies to address this issue. This study explores the
complex relationships between socio-economic, educational, welfare, and
demographic variables affecting birth rates across different regions.
The primary objective of this study is to identify the key drivers of declining birth rates and uncover regional patterns using advanced statistical methods. By employing Principal Component Analysis (PCA), Factor Analysis, Cluster Analysis, and Multidimensional Scaling (MDS), this research aims to provide actionable insights for policymakers to design targeted interventions for increasing birth rates.
Variable Descriptions :
Dependent Variable
Independent Variables
Family Factors :
Social Factors :
Educational Factors :
Welfare Factors :
Economic Factors :
To understand the relationships between variables, correlation
coefficients were calculated for each domain: social, family, education,
welfare, and economy.
birth = read.csv('birth.csv')
head(birth)
data = birth[,-1]
#Social Domain
round(cor(birth[, c("Average_Age_at_First_Marriage_Female","Population_of_Marriageable_Age_Women","Population_Growth_Rate","EQ5D")]), 4)
#Family Domain
round(cor(birth[, c("Marriage_Rate","Divorce_Rate")]), 2)
#Education Domain
round(cor(birth[, c("Number_of_Kindergartens", "Number_of_Elementary Schools")]))
#Welfare Domain
round(cor(birth[, c("Number_of_Cultural_Facilities", "Number_of_Childcare_Facilities")]))
#Economy Domain
round(cor(birth[, c("Fiscal_Independence", "Fiscal_Autonomy")]))
#Correlation Matrix
corr_matrix <- round(cor(birth), 2)
print(corr_matrix)
Below is a summary of key results:
The highest correlation was observed between Average Age at First
Marriage Female and Population of Marriageable Age Women (correlation:
0.096). However, the correlation between Average Age at First Marriage
Female and EQ5D (Quality of Life Index) was negligible.
Crude Marriage Rate and Crude Divorce Rate showed a strong positive
correlation (0.96).
Number of Kindergartens and Number of Elementary Schools were
strongly correlated (0.95).
Number of Cultural Facilities and Number of Childcare Facilities
showed low correlation (-0.24).
Fiscal Independence and Fiscal Autonomy had a moderate positive
correlation (0.35).
Some variables (e.g., Crude Marriage Rate and Population of Marriageable Age Women) showed moderate correlations, while others (e.g., Number of Kindergartens and Number of Elementary Schools) displayed stronger relationships.
pairs(test, panel = function (x, y, ...) {
points(x, y, ...)
abline(lm(y ~ x), col = 'red')
}, cex = 1.5, pch = "*", bg = "blue",
diag.panel = panel.hist, cex.labels = 1, font.labels = 1)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
[Fig. Scatterplot Matrix]
The scatterplot matrix reveals a strong linear relationship between
Total Fertility Rate and Average Age at First Marriage, as well as
between Number of Kindergartens and Number of Elementary Schools.
However, Cultural Facilities show no notable linear relationship
with other variables, suggesting they may not significantly contribute
to explaining fertility rates.
Additionally, the scatterplot
matrix highlights the presence of outliers, indicating the need for
further data preprocessing before re-evaluating the correlations and
patterns.
3) Data Preprocessing
To analyze the dataset effectively, the normality of each variable was assessed, followed by the identification of noticeable outliers.
layout(matrix(1:16, nc = 4))
sapply(colnames(data), function(x) {
qqnorm(data[[x]], main = x, )
qqline(data[[x]])
})
[Fig. Normality Check for Variables]
The Q-Q plots revealed that most variables, except for Childcare
Facilities, EQ5D, Average Age at First Marriage, Divorce Rate, Marriage
Rate, and Total Fertility Rate, deviated from normality.
Additionally, many variables were found to have outliers.
To detect potential outliers, chi-square plots were generated for the dataset.
x <- birth
cm <- colMeans(x)
S <- cov(x)
d <- apply(x, 1, function(x) {t(x-cm) %*% solve(S) %*% (x-cm)}) plot(qc <- qchisq((1:nrow(x) - 1/2) / nrow(x), df=7),
sd <- sort(d),
xlab = expression(paste(chi[7]^2, "Quantile")),
ylab = "Ordered distances", xlim = range(qc) * c(1, 1.1))
oups <- which(rank(abs(qc - sd), ties = "random") > nrow(x) - 5) text(qc[oups], sd[oups] - 1.5, names(oups))
abline(a=0, b=1)
[Fig. Chi-square Plot]
The chi-square plot revealed that the data for Hanam-si (Gyeonggi-do), Jeju Special Self-Governing Province, Yeongwol-gun (Gangwon-do), Haenam-gun (Jeollanam-do), and Jongno-gu (Seoul) deviated significantly from the expected distribution.
[Fig. Normality Check After Outlier Removal]
After removing outliers, Q-Q plots were generated for each
variable to assess normality.
4) Data Visualization
Scatterplots by Variable Categories
The full scatterplot
matrix provides an overview of pairwise relationships across all
variables.
Below, scatterplots are grouped by categories for
clearer insights.
family <- data[, c('Crude Marriage Rate', 'Crude Divorce Rate')]
pairs(family, diag.panel=panel.hist,
panel=function(x,y){
data<-data.frame(cbind(x,y))
par(new=TRUE)
den<-bkde2D(data,bandwidth=sapply(data,dpik))
contour(x=den$x1,y=den$x2,
z=den$fhat,axes=FALSE)
})
[Fig. Scatterplot for Family Factors]
The results indicate that the family factors exhibit a single peak, making it difficult to distinguish between regions.
education<-data[, c('Number of Kindergartens
','Number of Elementary Schools
')]
pairs(education, diag.panel=panel.hist,
panel=function(x,y){
data<-data.frame(cbind(x,y))
par(new=TRUE)
den<-bkde2D(data,bandwidth=sapply(data,dpik))
contour(x=den$x1,y=den$x2,
z=den$fhat,axes=FALSE)
})
[Fig. Scatterplot for Education Factors]
The results indicate that the education factors exhibit a single peak, making it difficult to differentiate between regions.
society<- data[, c('Marriageable Age Women
','Average Age at First Marriage','EQ5D','Population Growth Rate')] pairs(society, diag.panel=panel.hist,
panel=function(x,y){
data<-data.frame(cbind(x,y))
par(new=TRUE)
den<-bkde2D(data,bandwidth=sapply(data,dpik))
contour(x=den$x1,y=den$x2,
z=den$fhat,axes=FALSE)
})
[Fig. Scatterplot for Social Factors]
The results indicate that the bivariate probability density related to women of marriageable age in the social factors shows two peaks, while the other variables in the social domain exhibit a single peak, making it difficult to distinguish between regions.
welfare<-data[, c('Number of Childcare Facilities','Number of Cultural Facilities')]
pairs(welfare, diag.panel=panel.hist,
panel=function(x,y){
data<-data.frame(cbind(x,y))
par(new=TRUE)
den<-bkde2D(data,bandwidth=sapply(data,dpik))
contour(x=den$x1,y=den$x2,
z=den$fhat,axes=FALSE)
})
[Fig. Scatterplot for Welfare Factors]
The results indicate that the welfare factors exhibit multiple peaks, allowing regions to be classified into several groups.
economy<-data[, c('Fiscal Autonomy','Fiscal Independence')]
pairs(economy, diag.panel=panel.hist,
panel=function(x,y){
data<-data.frame(cbind(x,y))
par(new=TRUE)
den<-bkde2D(data,bandwidth=sapply(data,dpik))
contour(x=den$x1,y=den$x2,
z=den$fhat,axes=FALSE)
})
[Fig. Scatterplot for Economy Factors]
The results indicate that the economic factors exhibit multiple peaks, enabling regions to be classified into several groups.
birth = birth[-outcity,]
round(cor(birth),2)
birth_corr = cor(birth)
data_pcacor=princomp(covmat=birth_corr)
summary(data_pcacor, loading=TRUE)
birth_corr = cor(test)
data_pcacor=princomp(covmat=birth_corr)
summary(data_pcacor, loading=TRUE)
[Fig. Importance of Principal Components]
Principal Component Naming
[Fig. Principal Component Loadings by Variables]
Based on the results of the principal component analysis, the
components were named as follows:
library("MASS")
(data_mds3 = isoMDS(D))
x = data_mds3$points[, 1]
y = data_mds3$points[, 2]
lim = range(c(x, y)) * 1.2
plot(x, y, xlab = "Coordinate 1", ylab = "Coordinate 2",
, ylim = c(-50,50),type = "n")
text(x, y, rownames(test), cex = 0.7,col = c("blue","black")[data$city])
[Fig. Non-metric Multidimensional Scaling]
The results of the non-metric multidimensional scaling (Non-metric MDS) indicate a clear distinction between the Seoul-Incheon region and other regions.
Variables with uniqueness values greater than 0.5 were identified. These variables (Number of Childcare Facilities, Number of Cultural Facilities, Population Growth Rate, and EQ5D) were excluded from the analysis.
sapply(1:4, function(f)
factanal(scale(data) ,factors=f , method="mle")$PVAL)
(scores <- factanal(scale(data), factors = 4 method ="mle",scores = "regression")$scores)
factanal(scale(data), factors = 3, method ="mle")
birth_fit<-birth[,-c(1,7,8,13)]
birthfit_cor<-cor(test_fit)
sapply(1:4, function(f)
factanal(birthfit_cor ,factors=f , method="mle")$PVAL)
scores<-factanal(scale(test_fit),factors=4,method='mle',scores='regression')$scores factanal(covmat=testfit_cor,factors=4,method='mle',n.obs=221)
| Variable | Factor 1 | Factor 2 | Factor 3 | Factor 4 |
|---|---|---|---|---|
| Average Age at First Marriage Female | - | - | -0.131 | 0.799 |
| Population of Marriageable Age Women | 0.810 | 0.333 | - | 0.348 |
| Crude Marriage Rate | 0.191 | 0.938 | - | - |
| Number of Childcare Facilities | 0.985 | 0.148 | - | - |
| Number of Elementary Schools | 0.947 | 0.110 | - | - |
| Population Growth Rate | - | 0.345 | 0.328 | - |
| Crude Divorce Rate | 0.195 | 0.966 | - | 0.152 |
| Fiscal Independence | 0.454 | 0.485 | 0.533 | 0.402 |
| Fiscal Autonomy | - | - | 0.800 | -0.138 |
Based on the absolute factor loadings, the following groups
were identified:
birth = birth[, -1]
X <- scale(birth[, names(birth)])
dj <- dist(x)
plot(cc <- hclust(dj, method = "single"), main = "single")
plot(cc <- hclust(dj, method = "complete"), main = "complete")
plot(cc <- hclust(dj, method = "average"), main = "average")
Agglomerative Hierarchical Clustering was applied using
three methods:
Single Linkage, Complete Linkage, and Average
Linkage, to observe how the data is hierarchically classified.
The results showed that the Complete Linkage Method divided the data
into two major clusters.
To understand the criteria for these
clusters, the data will be analyzed by region.
pr <- prcomp(dj)$x[, 1:2]
plot(pr, pch=(1:2)[cutree(cs, k=2)], col = c("red", "black")[test$city], xlim = range(pr) * c(0.9, 1.5))
legend("topright", col = c("red", "red", "black", "black"), legend = c("1/no", "2/no", "1/yes", "2/yes"), pch = c(1:2, 1:2), title="Cluster/city", bty="n")
[Fig. Cluster Distribution by Region (PC1 vs. PC2)]
The results show the data projected onto the first two principal
component spaces.
The shape of the points represents the two
clusters (1 and 2), while the colors indicate the values of the city
variable (metropolitan and provincial areas vs. non-metropolitan areas).
The division along the first principal component is influenced by
the fact that it represents “Urban Development.”
However, this
distinction is not perfectly clear.
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data, 2, var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")}
wssplot(training.data)
[Fig. K-Means Clustering]
To perform K-Means Clustering, the number of groups must be specified. The “elbow” method is used by plotting the within-cluster sum of squares (WCSS) to determine the optimal number of clusters. In the graph, the “elbow” appears between 2 and 4 clusters. Therefore, K-Means Clustering will be conducted with 2 clusters.
round( sapply(birth, var), digits=1)
rge <- sapply(birth, function(x) diff(range(x)))
birth_s <- sweep(birth, 2, rge, FUN = "/")
round( sapply(birth_s, var), digits=5)
kmeans(birth_s, centers = 2)$centers * rge
birth_pca <- prcomp(birth_s)
plot(birth_pca$x[, 1:2], pch = kmeans(birth_s, centers = 2)$cluster)
[Fig. Scatter Plot of PC1 vs. PC2]
The plot above illustrates the two-group solution in the first two principal component spaces derived from the correlation matrix. Similar to the results of the Agglomerative Hierarchical Clustering, the two groups are well separated along the first principal component score, representing “Urban Development.”
Partitioning Around Medoids (PAM) is an alternative to K-Means Clustering, designed to address its sensitivity to outliers. Unlike K-Means, which represents clusters using centroids (mean vectors), PAM uses medoids (actual data points) to represent clusters. This allows PAM to work with different distance metrics, making it suitable for both continuous variables and mixed data types. Using this method, the data will be re-clustered.
install.packages("NbClust")
library(NbClust)
nc <- NbClust(training.data, min.nc=2, max.nc=15, method="kmeans") par(mfrow=c(1,1))
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen")
[Fig. Optimal Number of Clusters Determined by PAM]
To determine the optimal number of clusters more accurately, the NbClust package was used instead of relying solely on the elbow method.
The results suggest that 3 clusters are the most appropriate
solution.
install.packages("NbClust")
install.packages("caret")
#Standardization
data2 = data[, -14]
names(data2)
data3 <- scale(data2)
summary(data3)
#Model Building
fit.km=kmeans(data3, 3, nstart=25)
fit.km$cluster
fit.km$size
fit.km$centers
#Cluster Validation
fit$cluster <-as.factor(fit.km$cluster)
#Principal Component
dj <- dist(data3)
pr <- prcomp(dj)$x[, 1:2]
#Cluster
library(ggplot2)
set.seed(1234)
fit.pam <- pam(data[-1], k=3, stand=TRUE)
fit.pam$medoids
clusplot(fit.pam, main="Bivariate Cluster Plot")
fit <-pam(data3,k=3)
data3$clustering <- factor(fit$clustering)
ggplot(data=data3, aes(x=V1,y=V2,color=clustering, shape=clustering))+
geom_point() + ggtitle("Clustering of Bivariate Normal Data")
[Fig. Bivariate Cluster Plot]
The PAM results, visualized in the first two principal component
spaces, reveal three distinct clusters.
As shown in the figure, two clusters are primarily separated by the
first principal component (“Urban Development”), while the remaining
cluster is divided based on the second principal component (“Demographic
Factors”).
The interpretation of the ○ and + groups is consistent with the results from Agglomerative Hierarchical Clustering. The △ group, however, can be interpreted as regions with low “Demographic Factors” at the city and district levels.
This project analyzed key factors influencing birth rates at
the city and district levels using PCA, MDS, factor analysis, and
clustering.