• November 21, 2024

Web Scraping Linkedin

Use Selenium & Python to scrape LinkedIn profiles

Use Selenium & Python to scrape LinkedIn profiles

It was last year when the legal battle between HiQ Labs v LinkedIn first made headlines, in which LinkedIn attempted to block the data analytics company from using its data for commercial benefit.
HiQ Labs used software to extract LinkedIn data in order to build algorithms for products capable of predicting employee behaviours, such as when an employee might quit their job.
This technique known as Web Scraping, is the automated process where the HTML of a web page is used to extract data.
How hard can it be?
LinkedIn have since made its site more restrictive to web scraping tools. With this in mind, I decided to attempt extracting data from LinkedIn profiles just to see how difficult it would, especially as I am still in my infancy of learning Python.
Tools Required
For this task I will be using Selenium, which is a tool for writing automated tests for web applications. The number of web pages you can scrape on LinkedIn is limited, which is why I will only be scraping key data points from 10 different user profiles.
Prerequisite Downloads & Installs
Download ChromeDriver, which is a separate executable that WebDriver uses to control Chrome. Also you will need to have a Google Chrome browser application for this to work.
Open your Terminal and enter the following install commands needed for this task.
pip3 install ipython
pip3 install selenium
pip3 install time
pip3 install parsel
pip3 install csv
Automate LinkedIn Login
In order to guarantee access to user profiles, we will need to login to a LinkedIn account, so will also automate this process.
Open a new terminal window and type “ipython”, which is an interactive shell built with Python. Its offers different features including proper indentation and syntax highlighting.
We will be using the ipython terminal to execute and test each command as we go, instead of having to execute a file. Within your ipython terminal, execute each line of code listed below, excluding the comments. We will create a variable “driver” which is an instance of Google Chrome, required to perform our commands.
from selenium import webdriver
driver = (‘/Users/username/bin/chromedriver’)
(”)
The () method will navigate to the LinkedIn website and the WebDriver will wait until the page has fully loaded before another command can be executed. If you have installed everything listed and executed the above lines correctly, the Google Chrome application will open and navigate to the LinkedIn website.
The above notification banner should be displayed informing you that WebDriver is controlling the browser.
To populate the text forms on the LinkedIn homepage with an email address and password, Right Click on the webpage, click Inspect and the Dev Tools window will appear.
Clicking on the circled Inspect Elements icon, you can hover over any element on the webpage and the HTML markup will appear highlighted as seen above. The class and id attributes have the value “login-email”, so we can choose either one to use.
WebDriver offers a number of ways to find an element starting with “find_element_by_” and by using tab we can display all methods available.
The below lines will find the email element on the page and the send_keys() method contains the email address to be entered, simulating key strokes.
username = nd_element_by_class_name(‘login-email’)
nd_keys(”)
Finding the password attribute is the same process as the email attribute, with the values for its class and id being “login-password”.
password = nd_element_by_class_name(‘login-password’)
nd_keys(‘xxxxxx’)
Additionally we have to locate the submit button in order to successfully log in. Below are 3 different ways in which we can find this attribute but we only require one. The click() method will mimic a button click which submits our login request.
log_in_button = nd_element_by_class_name(‘login-submit’)
log_in_button = nd_element_by_class_id(‘login submit-button’)
log_in_button = nd_element_by_xpath(‘//*[@type=”submit”]’)
()
Once all command lines from the ipython terminal have successfully tested, copy each line into a new python file (Desktop/). Within a new terminal (not ipython) navigate to the directory that the file is contained in and execute the file using a similar command.
cd Desktop
python
That was easy!
If your LinkedIn credentials were correct, a new Google Chrome window should have appeared, navigated to the LinkedIn webpage and logged into your account.
Code so far…
“”” filename: “””
writer = (open(le_name, ‘wb’))
writer. writerow([‘Name’, ‘Job Title’, ‘Company’, ‘College’, ‘Location’, ‘URL’])
nd_keys(nkedin_username)
sleep(0. 5)
nd_keys(nkedin_password)
sign_in_button = nd_element_by_xpath(‘//*[@type=”submit”]’)
Searching LinkedIn profiles on Google
After successfully logging into your LinkedIn account, we will navigate back to Google to perform a specific search query. Similarly to what we have previously done, we will select an attribute for the main search form on Google.
We will use the “name=’q'” attribute to locate the search form and continuing on from our previous code we will add the following lines below.
search_query = nd_element_by_name(‘q’)
nd_keys(‘ AND “python developer” AND “London”‘)
nd_keys()
The search query ” AND “python developer” AND “London” will return 10 LinkedIn profiles per page.
Next we will be extracting the green URLs of each LinkedIn users profile. After inspecting the elements on the page these URLs are contained within a “cite” class. However, after testing within ipython to return the list length and contents, I seen that some advertisements were being extracted, which also include a URL within a “cite” class.
Using Inspect Element on the webpage I checked to see if there was any unique identifier separating LinkedIn URL’s from the advertisement URLs.
:/

As you can see above, the class value “iUh30” for LinkedIn URLs is different to that of the advertisement values of “UdQCqe”. To avoid extracting unwanted advertisements, we will only specify the “iUh30” class to ensure we only extract LinkedIn profile URL’s.
linkedin_urls = nd_elements_by_class_name(‘iUh30’)
We have to assign the “linkedin_urls” variable to equal the list comprehension, which contains a For Loop that unpacks each value and extracts the text for each element in the list.
linkedin_urls = [ for url in linkedin_urls]
Once you have assigned the variable ‘linkedin_urls” you can use this to return the full list contents or to return specific elements within our List as seen below.
linkedin_urls
linkedin_urls[0]
linkedin_urls[1]
In the ipython terminal below, all 10 account URLs are contained within the list.
Next we will create a new Python file called ” to contain variables such as search query, file name, email and password which will simplify our main “” file.
search_query = ‘ AND “python developer” AND “London”‘
file_name = ”
linkedin_username = ”
linkedin_password = ‘xxxxxx’
As we are storing these variables within a separate file called “” we need to import the file in order to reference these variables from within the “” file. Ensure both files “” and “” are in the same folder or directory.
import parameters
As we will be inheriting all the variables defined in “” using the imported parameters module above, we need to make changes within our “” file to reference these values from the “” file.
(arch_query)
nd_keys(linkedin_password)
As we previously imported the sleep method from the time module, we will use this to add pauses between different actions to allow the commands to be fully executed without interruption.
sleep(2)
from time import sleep
from import Keys
(‘:’)
sleep(3)
nd_keys(arch_query)
The fun part, scraping data
To scrape data points from a web page we will need to make use of Parsel, which is a library for extracting data points from websites. As we have already installed this at the start, we also need to import this module within our “”.
from parsel import Selector
After importing parsel within your ipython terminal, enter “ge_source” to load the full source code of the Google search webpage, which looks like something from the Matrix.
As we will want to extract data from a LinkedIn account we need to navigate to one of the profile URL’s returned from our search within the ipython terminal, not via the browser.
ge_source
We will create a For Loop to incorporate these commands into our “” file to iterate over each URL in the list. Using the method () it will update the “linked_url” variable with the current LinkedIn profile URL in the iteration….
for linkedin_url in linkedin_urls:
(linkedin_url)
sleep(5)
sel = Selector(ge_source)
Lastly we have defined a “sel” variable, assigning it with the full source code of the LinkedIn users account.
Finding Key Data Points
Using the below LinkedIn profile as an example, you can see that multiple key data points have been highlighted, which we can extract.
Like we have done previously, we will use the Inspect Element on the webpage to locate the HTML markup we need in order to correctly extract each data point. Below are two possible ways to extract the full name of the user.
name = (‘//h1/text()’). extract_first()
name = (‘//*[starts-with(@class, “pv-top-card-section__name”)]/text()’). extract_first()
When running the commands in the ipython terminal I noticed that sometimes the text isn’t always formatted correctly as seen below with the Job Title.
However, by using an IF statement for job_title we can use the () method which will remove the new line symbol and white spaces.
if job_title:
job_title = ()
Continue to locate each attribute and its value for each data point you want to extract. I recommend using the class name to locate each data point instead of heading tags e. g h1, h2. By adding further IF statements for each data point we can handle any text that may not be formatted correctly.
An example below of extracting all 5 data points previously highlighted.
name = (‘//*[starts-with(@class,
“pv-top-card-section__name”)]/text()’). extract_first()
if name:
name = ()
job_title = (‘//*[starts-with(@class,
“pv-top-card-section__headline”)]/text()’). extract_first()
company = (‘//*[starts-with(@class,
“pv-top-card-v2-section__entity-name pv-top-card-v2-section__company-name”)]/text()’). extract_first()
if company:
company = ()
college = (‘//*[starts-with(@class,
“pv-top-card-v2-section__entity-name pv-top-card-v2-section__school-name”)]/text()’). extract_first()
if college:
college = ()
location = (‘//*[starts-with(@class,
“pv-top-card-section__location”)]/text()’). extract_first()
if location:
location = ()
linkedin_url = rrent_url
Printing to console window
After extracting each data point we will output the results to the terminal window using the print() statement, adding a newline before and after each profile to make it easier to read.
print(‘\n’)
print(‘Name: ‘ + name)
print(‘Job Title: ‘ + job_title)
print(‘Company: ‘ + company)
print(‘College: ‘ + college)
print(‘Location: ‘ + location)
print(‘URL: ‘ + linkedin_url)
At the beginning of our code, below our imports section we will define a new variable “writer”, which will create the csv file and insert the column headers listed below.
The previously defined “file_name” has been inherited from the “” file and the second parameter ‘wb’ is required to write contents to the file. The writerow() method is used to write each column heading to the csv file, matching the order in which we will print them to the terminal console.
Printing to CSV
As we have printed the output to the console, we need to also print the output to the csv file we have created. Again we are using the writerow() method to pass in each variable to be written to the csv file.
writer. writerow([(‘utf-8’),
(‘utf-8’),
(‘utf-8’)])
We are encoding with utf-8 to ensure all characters extracted from each profile get loaded correctly.
Fixing things
If we were to execute our current code within a new terminal we will encounter an error similar to the one below. It is failing to concatenate a string to display the college value as there is no college displayed on this profile and so it contains no value.
To account for profiles with missing data points from which we are trying to extract, we can write a function”validate_field” which passing in “field” as variable. Ensure this function is placed at the start of this application, just under the imports section.
def validate_field(field):
field = ‘No results’
return field
In order for this function to actually work, we have to add the below lines to our code which validates if the field exists. If the field doesn’t exist the text “No results” will be assigned to the variable. Add these these lines before printing the values to the console window.
name = validate_field(name)
job_title = validate_field(job_title)
company = validate_field(company)
college = validate_field(college)
location = validate_field(location)
linkedin_url = validate_field(linkedin_url)
Lets run our code..
Finally we can run our code from the terminal, with the output printing to the console window and creating a new csv file called “”.
Things you could add..
You could easily amend my code to automate lots of cool things on any website to make your life much easier. For the purposes of demonstrating extra functionality and learning purposes within this application, I have overlooked aspects of this code which could be enhanced for better efficiency such as error handling.
Final code…
It was a long process to follow but I hope you found it interesting. Ultimately in the end LinkedIn, like most other sites, is pretty straight forward to scrape data from, especially using the Selenium tool. The full code can be requested by directly contacting me via LinkedIn.
Questions to be answered…
Are LinkedIn right in trying to prevent third party companies from extracting our publicly shared data for commercial purposes, such as HR departments or recruitment agencies?
Is LinkedIn is trying to protect our data or hoard it for themselves, holding a monopoly on our lucrative data?
Personally, I think that any software which can be used to help recruiters or companies match skilled candidates to better suited jobs is a good thing.
Scraping LinkedIn in 2021: Is it Legal? - Medium

Scraping LinkedIn in 2021: Is it Legal? – Medium

Photo by inlytics | LinkedIn Analytics Tool on UnsplashWeb scraping is essentially extracting data from certain platforms for further processing and transformation into useful outputs. While data scraping may be a sensitive topic in terms of data privacy and its legality, I will provide a breakdown as well as conclusions of a prominent LinkedIn scraping lawsuit as May 2017, LinkedIn sent hiQ, a web scraping company, a cease-and-desist letter where it asserted that hiQ was in violation of LinkedIn’s User Agreement. The letter demanded that hiQ stop accessing and copying data from LinkedIn’s server, stating that any future access by hiQ would be violating state and federal law, including the the Computer Fraud and Abuse Act (“CFAA”) and the Digital Millennium Copyright Act (“DMCA”) response, hiQ demanded that LinkedIn recognise hiQ’s right to access public pages on LinkedIn and sought a declaratory judgment, a conclusive decision by the court, that LinkedIn could not invoke, among other laws, the CFAA and DMCA against it. hiQ also requested a preliminary injunction against LinkedIn, seeking to prevent LinkedIn from acting on its cease-and-desist letter. The district court granted the preliminary injunction, ordering LinkedIn to withdraw the letter, remove technical barriers to hiQ’s access to public profiles, and refrain from implementing legal or technical measures to block hiQ’s access to public profiles until a ruling has been nkedIn appealed this decision to the US 9th Circuit Court of Appeals. In 2019, the 9th Circuit affirmed the district court’s preliminary that decision, LinkedIn has further appealed that decision to the US Supreme Court (SCOTUS), but it is unclear whether the court has agreed to hear the appeal. Until a judgment is released by SCOTUS, however, the decision by the 9th Circuit remains good observers have hailed the 9th Circuit decision as being a golden ticket permitting all types of web scraping, the issue is far more nuanced than that. In fact, the scope of the issue is extremely narrow, turning on the definition of “without authorization”. In hiQ’s own words, the question before SCOTUS is:QUESTION PRESENTED: Whether a professional networking website may rely on the Computer Fraud and Abuse Act’s prohibition on “intentionally access[ing] a computer without authorization” to prevent a competitor from accessing information that the website’s users have shared on their public profiles and that is available for viewing by anyone with a web unsel for hiQ in its brief to SCOTUSThe narrowness of the issue presented to SCOTUS by LinkedIn means that the court only has to decide on this one matter, and will not have to consider other potential issues arising from web scraping such as data privacy concerns, breach of contractual terms, or even violations of other state and federal laws. Optimistically, it can be inferred that because LinkedIn decided to pursue the case on this ground instead of through other causes of action, they are less likely to be potential issues for web scrapers. But the reality is that due to web scraping being a relatively new phenomenon, the law surrounding it remains underdeveloped and there is little legal clarity in the there is still a grey area regarding the legality of web scraping, we can say for certain that web scraping in itself remains legal. This is big news for both individuals and companies alike. With the large amount of data presented online, there is a tumultuous amount of information available that is difficult to obtain any useful insights from on its own. Thankfully, there are many web scrapers made available that are able to tidy up the necessary data and eliminate any white noises. Scraping popular platforms such as Reddit, Twitter, Facebook and especially LinkedIn can be extremely beneficial to companies, detailed in this individualsThe average Joe interested in exploring web scraping can probably get by with free web scraping APIs that can obtain small amounts of data. Some side projects to consider if you are interested in picking up web scraping would be scraping food review websites such as Yelp or Burpple and find the best fried chicken in your country or by scraping social media platforms such as Reddit and Twitter and conduct the necessary analysis to decide your next investment in the stock large-scale projects that require data of millions of individuals, it is definitely not feasible to rely on these free but slow web scraping APIs and wait weeks, if not months, for the data to be collected (if your computer does not overheat and crash by then) companiesApart from food review sites and social media platforms, LinkedIn seems to be the most relevant platform to scrape from for B2B companies. Depending on the magnitude of data you require, there are many paid LinkedIn scraping services that satisfy different needs. A comprehensive list of the top 5 varying LinkedIn scraping services can be found here. This provides a better understanding of what these different companies offer and find the service best suited to your companies’ data provided by the scraping services, businesses are able to use it for many functions:Updating its current database: Enrich current database with up-to-date dataLeads Generation for B2B sales: LinkedIn URL/email discoveryResearch: Use company data to predict market and industry trendsHuman Resource: Improves hiring for ATS and recruitment platformsInvestment (Venture Capitalists): Chart out company performances and decide which companies are performing wellAlumni (Universities): Find out distribution of their alumni based on location, industry or companies with further transformation of dataHere at Mantheos we conduct LinkedIn scraping legally, scraping data that is freely and publicly available on LinkedIn. This means that we collect data that is accessible to the general public. Compared to manually searching LinkedIn for people and company profiles, we automate this process for you and aggregate this information into readable files such as excel and json. By engaging our services, you can rest assured that we will provide data that is both safe as well as useful in your the legality of web scraping becomes clearer, we can safely say that many forms of web scraping are not deemed illegal by the courts and are permissible. Web scraping is an integral part of the big data revolution and is empowering millions of businesses around the world to optimise their business strategies. With web scraping becoming ever more ubiquitous, the myriad of privacy and contractual issues surrounding web scraping is growing more complex. This forms a potential stumbling block for both web scraping companies and end laws become more rigid and penalties for violations increase, it is now more important than ever before to ensure that your business is not exposed to unnecessary legal risk by unknowingly flouting data laws. Mantheos prides itself on ensuring that its business practices are fully compliant with all laws and regulations, regardless of jurisdiction. Yada ferences:hiQ Labs, Inc. v. LinkedIn Corp., №17–16783 (9th Cir. 2019)LinkedIn’s appeal to the US Court of Appeals
How to Scrape LinkedIn - Dev Genius

How to Scrape LinkedIn – Dev Genius

It’s become rather difficult to scrape some of the larger tech websites, such as LinkedIn. Likely due to the amount of personal information at stake. But I’m here to mess up everything they worked so hard to prevent. Also, For the record, this is for educational purposes brariesSoup ObjectsCSRF TokensLogin Request and HTMLPotentially add more pagesScraping Metadata for Each ColumnCreate and Append Columns to DataFrameThis part is a little tricky. Frist import the libraries, then create the request session, then I just added some link variables to different linkedin pages, then I created the soup variables and this is where it’s a little bit fuzzy. You’ll have to use the csrf line to get past the token, that linkedin requires for each request. Then you’ll pass that into the login request. Keep in mind, this uses your linkedin! I don’t know the parameters or rules for what will ultimately get you blocked, but use at your own risk. I never had any issues you get stuck, just refresh the notebook (jupyter gets hung up some times) project is using a url with the query “aspiring data scientist” in the dummy account linkedin Creds if you’re requestsfrom bs4 import BeautifulSoup#create a sessionclient = ssion()#create an email and password variableyour_email_here = “some_email”your_password = “your_password”#create url page variablesHOMEPAGE_URL = ‘’LOGIN_URL = ‘’CONNECTIONS_URL = ‘’ASPIRING_DATA_SCIENTIEST = ‘’#get url, soup object and csrf token valuehtml = (HOMEPAGE_URL). contentsoup = BeautifulSoup(html, “”)csrf = (‘input’, dict(name=’loginCsrfParam’))[‘value’]#create login parameterslogin_information = { ‘session_key’: your_email, ‘session_password’: your_password, ‘loginCsrfParam’: csrf, }Results:Since Linkedin (I think — and don’t quote me on this) uses a javascript injected way of storing the data, beautifulsoup was less helpful. Thus, I scraped the metadata. It’s way easier, in my opinion to scrape data inside of a dataframe, rather than raw text. Thus, I put everything in a list, put that list in a dataframe and then created a string column using map. #create dataframe with soup contentsimport pandas as pdsoup = list(soup)soupdf = Frame()df[‘test’] = soup#Create string column of soup contents df[‘liststring’] = [‘, ‘(map(str, l)) for l in df[‘test’]]dfResults:Now that’s a good lookin’ DataFrame! The project was a work in progress the entire time. Scraping text to parse out bits and pieces of actual data is not glamorous. There may be a quicker way to do a lot of this in regex — I exited this step as soon as I found working code. #parse out all the text except the interesting stuffdf1 = Frame((‘{‘)())(). reset_index()lumns = [‘level_0’, ‘level_1’, ‘test’]df1 = df1[df1[‘test’](“textDirection”)]df1 = df1[df1[‘test’](‘type”:””}, “type”:”PROFILE’)]df1Results:Getting Closer! This code snippet creates two columns in our dataframe: name and shared_connections with the account that logged (‘x_rows’, 500)t_option(‘x_columns’, 500)t_option(”, 1000)t_option(‘x_colwidth’, None)df1 = place(‘”textDirection”:”USER_LOCALE”, “text”:”‘, ”, regex=True)df1 = place(‘ shared connections”, “snippetText”:’, ”, regex=True)df1 = place(‘type”:””, “headless”:false, “socialProofText”:”‘, ”, regex=True)df1 = place(‘”, “‘, ”, regex=True)df1 = place(‘ shared connectionsnippetText”:’, ”, regex=True)df1 = place(‘type”:””}, “type”:”PROFILE’, ”, regex=True)df1 = place(‘$$1’, ”, regex=True)df1 = place(‘$$2’, ”, regex=True)df1 = place(‘$$3’, ”, regex=True)df1 = place(‘$$4’, ”, regex=True)df1 = place(‘$$5’, ”, regex=True)df1 = place(‘$$6’, ”, regex=True)df1 = place(‘$$7’, ”, regex=True)df1 = place(‘$$8’, ”, regex=True)df1 = place(‘$$9’, ”, regex=True)#drop ([‘level_0’, ‘level_1’], axis = 1, inplace = True)#Create shared connections columndf1[‘shared_connections’] = df1[‘test’][-1:]#remove last 3 charactersdf1[‘name’] = df1[‘test’][:-3]#drop test coldf1 = ([‘test’], axis = 1)#rename columnsdf1 = df1[[‘name’, ‘shared_connections’]]df1Results:Now You’re Talkin’This bit of code creates a new dataframe with one column called test because i’m that original; this column is the title of the user — for instance, my title says “Data Engineering Consultant”. df2 = Frame((‘{‘)())(). reset_index()lumns = [‘level_0’, ‘level_1’, ‘test’]df2 = df2[df2[‘test’](‘”textDirection”:”USER_LOCALE”, “text”‘)]df2 = df2[df2[‘test’](‘””}, “nameMatch”:false, “subline”:’)]df2 = place(‘”textDirection”:”USER_LOCALE”, “text”:”‘, ”, regex=True)df2 = place(‘”nameMatch”:false, “subline”:’, ”, regex=True)df2 = place(”, ”, regex=True)df2[‘test’] = df2[‘test’][:-11]df2 = place(‘”, “‘, ”, regex=True)#drop ([‘level_0’, ‘level_1’], axis = 1, inplace = True)df2Results:Tight broLets merge the last two dataframes real quick. #add title columntitle = list()df1[‘title’] = titledf1Results:Not. Too. Shabby. Shhh.. Abby (if you get this, we should be friends)I really do have too much fun with these posts. Let’s add one more column called test — you heard me, this one will be for the location — if the user offered up that info on their profile that is. #filter out locationdf3 = Frame((‘{‘)())(). reset_index()lumns = [‘level_0’, ‘level_1’, ‘test’]df3 = df3[df3[‘test’](‘”textDirection”:”USER_LOCALE”, “text”‘)]df3 = df3[df3[‘test’](‘”}, “trackingId”‘)]df3 = place(‘”textDirection”:”USER_LOCALE”, “text”:”‘, ”, regex=True)df3 = place(‘”}’, ”, regex=True)df3 = place(‘”trackingId”‘, ”, regex=True)df3 = place(‘$type”:”‘, ”, regex=True)df3 = place(‘type”:””}, ‘, ”, regex=True)df3 = place(‘type”:”, :”G6x05qiqQfO29accDNG79w==”}], “type”:”SEARCH_HITS”, “‘, ”, regex=True)df3 = place(‘$type”:”, :’, ”, regex=True)df3[‘test’] = df3[‘test’][:-41]df3Results:Austin Texas, I see youLet’s merge the last and the original dataframes together and see what we get. We’re all very excited. #create location columnlocation = list()df1[‘location’] = locationdf1Results:Done SonWe wrote a script that logs us in to linkedin, scrapes the first page of of a specific search, parses out name, shared_connections, title and location and put all this data into a pandas you found this helpful, feel free to subscribe or smash that clap button yeah! Source:

Frequently Asked Questions about web scraping linkedin

Does LinkedIn allow web scraping?

Here at Mantheos we conduct LinkedIn scraping legally, scraping data that is freely and publicly available on LinkedIn. This means that we collect data that is accessible to the general public.Jun 15, 2021

Is LinkedIn hard to scrape?

It’s become rather difficult to scrape some of the larger tech websites, such as LinkedIn. Likely due to the amount of personal information at stake.Jun 30, 2020

How do I web scrape my LinkedIn profile?

In this article, I explained that scraping Linkedin profiles is a two-step process. The first step is to crawl Linkedin profiles and save the HTML code for further processing in the second step. The second step is to process the HTML code and turn raw HTML code into structured data that you can use in your application.Sep 8, 2020

Leave a Reply

Your email address will not be published. Required fields are marked *