The H-1B: An analysis of American companies’ requests for external labour

Published in

codeburst

9 min readApr 30, 2018

The H-1B is a United States visa permitting employers in the United States to employ external labour in specialty occupations. Details about this class of visa can be read here.

While this is a popular topic and one of interest to a good number of people, what this article seeks to do is to combine data analysis skills to get information from the H-1B data available. The H-1B data is publicly available for past periods and so, we could just analyse the data for the 2017/2018 period.

Description of the data

Data: The data set, as obtained from the Office of Foreign Labor Certification (OFLC), covers the period from October 1, 2017, through March 31, 2018. It covers different cases of Labor Condition Applications (LCAs), which employers must file with the United States Department of Labor Employment and Training Administration (ETA) on behalf of employees for a non-immigrant H-1B. Further description of the variables in the data set can be viewed here.

Focus: To analyze the data set, in order to get information that might be of interest, such as the industries and nationalities that have the greatest concentration in the applications. That might just be a pointer to identifying where the majority of needs are, in terms of workforce, and identifying the majority of nations that are moving to fill those needs.

Exploratory data analysis

The data was converted from the XLSX to CSV format, due to memory issues. CSV files seem much easier to handle. The data has 410,605 observations of 52 variables. Picking certain variables, we would look at how they are concentrated in the data.

library(dplyr)
library(plyr)
library(tidyverse)
library(caret)
library(lubridate)
library(gridExtra)
h1b_data <-  read.csv('H-1B_FY2018.csv',header = T)
h1b_data <- as.tibble(h1b_data)h1b_data$CASE_SUBMITTED <- as.Date(as.character(h1b_data$CASE_SUBMITTED), format = "%d/%m/%Y")# number of applications submitted in 2017 & 2018
nrow(h1b_data[year(h1b_data$CASE_SUBMITTED) == 2017 | year(h1b_data$CASE_SUBMITTED) == 2018,])
h1b_data = h1b_data[year(h1b_data$CASE_SUBMITTED) == 2017 | year(h1b_data$CASE_SUBMITTED) == 2018,]

With the necessary libraries loaded in R and the data converted to a tibble for better human reading, it is seen that 402,337 of the applications were submitted in the 2017/2018 year and others totaling 8,268 were submitted in years prior to that but had a decision made on them within the 2017/2018 year. We are considering all applications submitted in the 2017/2018 year, which are 402,337 in total.

Some of the variables of interest are CASE_STATUS, PW_WAGE_LEVEL, EMPLOYER_NAME, EMPLOYER_CITY, EMPLOYER_STATE, AGENT_REPRESENTING_EMPLOYER, AGENT_ATTORNEY_NAME, AGENT_ATTORNEY_STATE, FULL_TIME_POSITION, PW_WAGE_LEVEL, WILLFUL_VIOLATOR, H-1B_DEPENDENT, SOC_CODE and these are selected out of the data.

h1b_data <- select(h1b_data, 'CASE_STATUS', 'CASE_SUBMITTED', 'PW_WAGE_LEVEL', 'SOC_CODE', 'EMPLOYER_NAME', 'EMPLOYER_CITY',
'EMPLOYER_STATE', 'AGENT_REPRESENTING_EMPLOYER', 'AGENT_ATTORNEY_NAME','AGENT_ATTORNEY_STATE','FULL_TIME_POSITION',
'PW_WAGE_LEVEL','WILLFUL_VIOLATOR','H.1B_DEPENDENT')

At this point, there are now 402,337 observations of 13 variables.

The case status of any application can take on 4 values, which are:

Certified: A certified Labor Condition Application (LCA), is a prerequisite to H-1B approval. So, “certified” means the employer filed the LCA, which was approved by Department of Labour (DOL) and the necessary prerequisite for an H-1B approval is in place.

Certified-Withdrawn: This means the LCA was approved but was later on withdrawn by the employer, for some reason. It could be that the employee worked for some years before the contract was terminated.

Denied: Means that the LCA was denied and so, the necessary prerequisite for an H-1B approval is not in place.

Withdrawn: Means that the LCA was withdrawn before approval or denial. So, no decision was taken before the employer withdrew the application.

Regrouping

Some re-grouping was done, based on the SOC_CODE, which is the occupational code associated with the job being requested. All observations under the “15-” category were classified as “COMPUTING, STATISTICIANS”, as most of these roles had to do with software development, data science, machine learning and other roles requiring core computer programming skills. Also, jobs requiring finance, accounting and economics skills under the “13-” category were grouped as “FINANCIALS, COMPLIANCE” and those requiring customer care, attendant and receptionist skills, under the “39-” category were tagged “RECEIPTIONISTS, SERVICE ATTENDANTS”. Two observations falling in the “40-” and “71-” category were classified as being under the “17-” category, tagged “ENGINEERING EXCEPT COMPUTERS”. After re-grouping, 16 groups re-emerged; smaller than the number of the initial grouping. A snippet of the function used for the re-grouping is seen below.

# function for categorisation
getJobCategory <- function(soc_code)
{
  category <- sapply(soc_code, function(x)
  {
    switch(x,
           '11' = 'MANAGERIAL, ADMIN',
           '13' = 'FINANCIALS, COMPLIANCE',
           '15' = 'COMPUTING, STATISTICIANS',
           '17' = 'ENGINEERING EXCEPT COMPUTERS',
           '19' = 'SCIENTISTS',
           '21' = 'PSYCHOLOGY, COUNSELLING, SOCIAL WORKS',
           '23' = 'LEGAL',
           '25' = 'EDUCATORS, CURATORS',
           '27' = 'DESIGNERS, COACHES',
           '29' = 'MEDICALS',
           '31' = 'HEALTHCARE ASSTS',
           '33' = 'SECURITY',
           '35' = 'CULINARY',
           '37' = 'CLEANING, KEEPING',
           '39' = 'RECEIPTIONISTS, SERVICE ATTENDANTS',
           '40' = 'ENGINEERING EXCEPT COMPUTERS',
           '41' = 'TRADERS, SALES REPS',
           '43' = 'QUALITY, STATISTICAL ASSTS',
           '45' = 'AGRICULTURAL',
           '47' = 'ARTISANS',
           '49' = 'SERVICE TECHNICIANS',
           '51' = 'MACHINISTS',
           '53' = 'TRANSPORT',
           '71' = 'ENGINEERING EXCEPT COMPUTERS',
           as.character(x)
    )
    
  })
}# regroup to a wider classification, based on SOC_CODE
h1b_data <- h1b_data[grepl('^[0-9]{2}-', h1b_data$SOC_CODE),]
h1b_data$SOC_CODE <- substr(h1b_data$SOC_CODE, start = 1, stop = 2)
h1b_data <- h1b_data %>%
            mutate(JOB_CATEGORY = getJobCategory(h1b_data$SOC_CODE))

After regrouping and considering only the observations with the right SOC_CODE format, the number of observations comes to 401,735 observations of 14 variables.

Taking a look at the following plots:

Case Statuses

ggplot(h1b_data, mapping = aes(x = CASE_STATUS, fill = CASE_STATUS)) +
  geom_bar(aes(y = ..count../1000)) +
  labs(title = 'Case Status Proportions', x = 'Case Status', y =  'Count (000s)')

nrow(h1b_data[h1b_data$CASE_STATUS == 'CERTIFIED',])
nrow(h1b_data[h1b_data$CASE_STATUS == 'CERTIFIED-WITHDRAWN',])
nrow(h1b_data[h1b_data$CASE_STATUS == 'WITHDRAWN',])
nrow(h1b_data[h1b_data$CASE_STATUS == 'DENIED',])

372,086 applications were certified; 11,930 were certified and then withdrawn; 12,339 were withdrawn and 5,380 were denied.

Wage Levels

ggplot(h1b_data, mapping = aes(x = PW_WAGE_LEVEL, fill = PW_WAGE_LEVEL)) +
  geom_bar(aes(y = ..count../1000)) +
  labs(title = 'Wage Level Proportions', x = 'Wage Level', y =  'Count (000s)')

wage_grouped_data <- h1b_data %>%
                      group_by(PW_WAGE_LEVEL) %>%
                      dplyr::summarize(count = n())View(wage_grouped_data)

74,610 applications were for a Level I wage; 212,480 were for a Level II wage; 59,809 were for a Level III wage and 31,886 were for a Level IV wage.

Though, 22,950 applications did not specify the wage level, we see that the majority of requests are for a wage level II position. A wage level II means that the employee has obtained the necessary skills, either by way of education or experience for the job. Level I employees are at an entry level and levels higher than II require greater competency and professional expertise. N/A means that the wage level was not specified for the number of applications shown on the chart. With the new changes coming to the H-1B regulations, there would likely be a removal of the Level I. This means, there would be no acceptance of requests for entry level personnel, anymore.

Findings

Much of the summary to be done will be done using only the 3 lines of code below, with only a change to the variable of interest, each time.

Jobs

job_grouped_data <- h1b_data %>%
  group_by(JOB_CATEGORY) %>%
  dplyr::summarize(count = n(), percent = round(100 * count / nrow(h1b_data), 3))job_grouped_data <-  job_grouped_data[order(-job_grouped_data$count), ]View(job_grouped_data)

From the summary, it is seen that the “COMPUTING, STATISTICIANS” category has the highest percentage of requests for an LCA, being almost 70% of the total requests.

Tech roles are hot in the US…Apparently!!!

This is followed by jobs in the “FINANCIALS, COMPLIANCE” category, which accounts for almost 10% of requests and the “ENGINEERING EXCEPT COMPUTERS”, which is about 8%. That’s a wide gap; a really wide one!

This shows that America’s demand for external labour is more in the tech industry than in any other field. This is not far-fetched.

The tech talent gap is real. Increased diversity is the solution.

This column is part of a series called " Voices of Women in Tech ," created in collaboration with AnitaB.org, a global…

mashable.com

States

state_grouped_data <- h1b_data %>%
  group_by(EMPLOYER_STATE) %>%
  dplyr::summarize(count = n(), percent = round(100 * count / nrow(h1b_data), 3))state_grouped_data <-  state_grouped_data[order(-state_grouped_data$count), ]View(state_grouped_data)

And the winners are…

California is in the lead of states requesting authorizations, followed by New Jersey and then Texas, New York and Illinois. This simply implies that folks looking to get an authorization should target companies in these regions, as they are more open to international labour, compared to companies in other states. For the full name of each state and territory, please refer here.

Cities

city_grouped_data <- h1b_data %>%
  group_by(EMPLOYER_CITY, EMPLOYER_STATE) %>%
  dplyr::summarize(count = n(), percent = round(100 * count / nrow(h1b_data), 3))city_grouped_data <-  city_grouped_data[order(-city_grouped_data$count), ]View(city_grouped_data)

And the winners are…

There are about 5,183 cities recorded in the data but this is a view of the 20 cities making the highest amount of requests.

New York City in New York, Chicago in Illinois, Philadelphia in Pennsylvania, Rockville in Maryland and Plano in Texas are the 5 cities with the highest amount of requests.

Nationalities

It would be good to also have an idea of the nations of employees for whom the LCA requests are made. This would not be possible using the currently available H-1B data, as the nationalities of the employees are not stated. However, we could have an idea from the PERM data, which is a migrant authorization for permanent labour, popularly known as the “Green Card”.

Though, it is not exactly the data on the H-1B, it could give an idea of the nationalities, as many people have the non-immigrant visa before getting the permanent work authorization. Therefore, we could infer that most of those who put in applications for a permanent work authorization, have at one time held the H-1B.

The description of the variables in the data set can be viewed here.

Looking at the data, the COUNTRY_OF_CITIZENSHIP and the FW_INFO_BIRTH_COUNTRY are variables that tell the nationality of the employee. Using either one yields almost the same results.

library(dplyr)
library(plyr)
library(tidyverse)
library(caret)
library(lubridate)
# read in data from xlsx sheet
perm_data <- readxl::read_xlsx('PERM_FY2018.xlsx')# select just the few data variables of interest
perm_data <- select(perm_data,
                   'COUNTRY_OF_CITIZENSHIP', 'FW_INFO_BIRTH_COUNTRY')# country of citizenship
emp_birth_country_grouped_data <- perm_data %>%
  group_by(COUNTRY_OF_CITIZENSHIP) %>%
  dplyr::summarize(count = n(), percent = 100 * count / nrow(perm_data))emp_birth_country_grouped_data <-  emp_birth_country_grouped_data[order(-emp_birth_country_grouped_data$count), ]View(emp_birth_country_grouped_data)

And the winners are…

So, the top nations providing external labour to the American economy are India, China, South Korea, Canada and Mexico, with India topping the list, having almost 50% of requests.

Canadians and Mexicans generally do not have to go hold an H-1B because of the agreement between the US and their countries. Instead, they hold a TN visa, which is a non-immigrant visa that was created under the North American Free Trade Agreement, and then get a permanent authorization afterwards.

Other variables of interest would be the top employers and attorney firms. It would be good to know which legal firms process the H-1Bs the most, which likely means they are very good at settling such matters.

Deloitte Consulting LLP, Tata Consultancy Services Limited, Infosys Limited, IBM and Google seem to be among the top recruiting companies.

Do note that the data might be updated with time and may change, causing the results one may get at a later time to be slightly different from what has been obtained in this analysis.

Please feel free to connect with me on Linkedin; let me know of opportunities that you think might be of interest to me and also leave your comments below.

With the analysis done so far, I guess it’s time to present our findings to the Congress!!!

✉️ Subscribe to CodeBurst’s once-weekly Email Blast, 🐦 Follow CodeBurst on Twitter, view 🗺️ The 2018 Web Developer Roadmap, and 🕸️ Learn Full Stack Web Development.

codeburst

The H-1B: An analysis of American companies’ requests for external labour

Description of the data

Exploratory data analysis

Regrouping

Taking a look at the following plots:

Findings

Jobs

The tech talent gap is real. Increased diversity is the solution.

This column is part of a series called " Voices of Women in Tech ," created in collaboration with AnitaB.org, a global…

States

Cities

Nationalities

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in codeburst

Written by Olawunmi George

No responses yet