Python Parse Text
Parsing text with Python – vipinajayakumar
I hate parsing files, but it is something that I have had to do at the start of nearly every project. Parsing is not easy, and it can be a stumbling block for beginners. However, once you become comfortable with parsing files, you never have to worry about that part of the problem. That is why I recommend that beginners get comfortable with parsing files early on in their programming education. This article is aimed at Python beginners who are interested in learning to parse text files.
In this article, I will introduce you to my system for parsing files. I will briefly touch on parsing files in standard formats, but what I want to focus on is the parsing of complex text files. What do I mean by complex? Well, we will get to that, young padawan.
For reference, the slide deck that I use to present on this topic is available here. All of the code and the sample text that I use is available in my Github repo here.
Why parse files?
The big picture
Parsing text in standard format
Parsing text using string methods
Parsing text in complex format using regular expressions
Step 1: Understand the input format
Step 2: Import the required packages
Step 3: Define regular expressions
Step 4: Write a line parser
Step 5: Write a file parser
Step 6: Test the parser
Is this the best solution?
Conclusion
First, let us understand what the problem is. Why do we even need to parse files? In an imaginary world where all data existed in the same format, one could expect all programs to input and output that data. There would be no need to parse files. However, we live in a world where there is a wide variety of data formats. Some data formats are better suited to different applications. An individual program can only be expected to cater for a selection of these data formats. So, inevitably there is a need to convert data from one format to another for consumption by different programs. Sometimes data is not even in a standard format which makes things a little harder.
So, what is parsing?
Parse
Analyse (a string or text) into logical syntactic components.
I don’t like the above Oxford dictionary definition. So, here is my alternate definition.
Convert data in a certain format into a more usable format.
With that definition in mind, we can imagine that our input may be in any format. So, the first step, when faced with any parsing problem, is to understand the input data format. If you are lucky, there will be documentation that describes the data format. If not, you may have to decipher the data format for yourselves. That is always fun.
Once you understand the input data, the next step is to determine what would be a more usable format. Well, this depends entirely on how you plan on using the data. If the program that you want to feed the data into expects a CSV format, then that’s your end product. For further data analysis, I highly recommend reading the data into a pandas DataFrame.
If you a Python data analyst then you are most likely familiar with pandas. It is a Python package that provides the DataFrame class and other functions to do insanely powerful data analysis with minimal effort. It is an abstraction on top of Numpy which provides multi-dimensional arrays, similar to Matlab. The DataFrame is a 2D array, but it can have multiple row and column indices, which pandas calls MultiIndex, that essentially allows it to store multi-dimensional data. SQL or database style operations can be easily performed with pandas (Comparison with SQL). Pandas also comes with a suite of IO tools which includes functions to deal with CSV, MS Excel, JSON, HDF5 and other data formats.
Although, we would want to read the data into a feature-rich data structure like a pandas DataFrame, it would be very inefficient to create an empty DataFrame and directly write data to it. A DataFrame is a complex data structure, and writing something to a DataFrame item by item is computationally expensive. It’s a lot faster to read the data into a primitive data type like a list or a dict. Once the list or dict is created, pandas allows us to easily convert it to a DataFrame as you will see later on. The image below shows the standard process when it comes to parsing any file.
If your data is in a standard format or close enough, then there is probably an existing package that you can use to read your data with minimal effort.
For example, let’s say we have a CSV file,
a, b, c
1, 2, 3
4, 5, 6
7, 8, 9
You can handle this easily with pandas.
123
import pandas as pd
df = ad_csv(”)
df
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Python is incredible when it comes to dealing with strings. It is worth internalising all the common string operations. We can use these methods to extract data from a string as you can see in the simple example below.
1 2 3 4 5 6 7 8 9101112131415161718192021
my_string = ‘Names: Romeo, Juliet’
# split the string at ‘:’
step_0 = (‘:’)
# get the first slice of the list
step_1 = step_0[1]
# split the string at ‘, ‘
step_2 = (‘, ‘)
# strip leading and trailing edge spaces of each item of the list
step_3 = [() for name in step_2]
# do all the above operations in one go
one_go = [() for name in (‘:’)[1](‘, ‘)]
for idx, item in enumerate([step_0, step_1, step_2, step_3]):
print(“Step {}: {}”(idx, item))
print(“Final result in one go: {}”(one_go))
Step 0: [‘Names’, ‘ Romeo, Juliet’]
Step 1: Romeo, Juliet
Step 2: [‘ Romeo’, ‘ Juliet’]
Step 3: [‘Romeo’, ‘Juliet’]
Final result in one go: [‘Romeo’, ‘Juliet’]
As you saw in the previous two sections, if the parsing problem is simple we might get away with just using an existing parser or some string methods. However, life ain’t always that easy. How do we go about parsing a complex text file?
with open(”) as file:
file_contents = ()
print(file_contents)
Sample text
A selection of students from Riverdale High and Hogwarts took part in a quiz.
Below is a record of their scores.
School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel
Student number, Score
0, 3
1, 7
Grade = 2
0, Angela
1, Tristan
2, Aurora
0, 6
1, 3
2, 9
School = Hogwarts
0, Ginny
1, Luna
0, 8
0, Harry
1, Hermione
0, 5
1, 10
Grade = 3
0, Fred
1, George
0, 0
1, 0
That’s a pretty complex input file! Phew! The data it contains is pretty simple though as you can see below:
Name Score
School Grade Student number
Hogwarts 1 0 Ginny 8
1 Luna 7
2 0 Harry 5
1 Hermione 10
3 0 Fred 0
1 George 0
Riverdale High 1 0 Phoebe 3
1 Rachel 7
2 0 Angela 6
1 Tristan 3
2 Aurora 9
The sample text looks similar to a CSV in that it uses commas to separate out some information. There is a title and some metadata at the top of the file. There are five variables: School, Grade, Student number, Name and Score. School, Grade and Student number are keys. Name and Score are fields. For a given School, Grade, Student number there is a Name and a Score. In other words, School, Grade, and Student Number together form a compound key.
The data is given in a hierarchical format. First, a School is declared, then a Grade. This is followed by two tables providing Name and Score for each Student number. Then Grade is incremented. This is followed by another set of tables. Then the pattern repeats for another School. Note that the number of students in a Grade or the number of classes in a school are not constant, which adds a bit of complexity to the file. This is just a small dataset. You can easily imagine this being a massive file with lots of schools, grades and students.
It goes without saying that the data format is exceptionally poor. I have done this on purpose. If you understand how to handle this, then it will be a lot easier for you to master simpler formats. It’s not unusual to come across files like this if have to deal with a lot of legacy systems. In the past when those systems were being designed, it may not have been a requirement for the data output to be machine readable. However, nowadays everything needs to be machine-readable!
We will need the Regular expressions module and the pandas package. So, let’s go ahead and import those.
12
import re
In the last step, we imported re, the regular expressions module. What is it though?
Well, earlier on we saw how to use the string methods to extract data from text. However, when parsing complex files, we can end up with a lot of stripping, splitting, slicing and whatnot and the code can end up looking pretty unreadable. That is where regular expressions come in. It is essentially a tiny language embedded inside Python that allows you to say what string pattern you are looking for. It is not unique to Python by the way (treehouse).
You do not need to become a master at regular expressions. However, some basic knowledge of regexes can be very handy in your programming career. I will only teach you the very basics in this article, but I encourage you to do some further study. I also recommend regexper for visualising regular expressions. regex101 is another excellent resource for testing your regular expression.
We are going to need three regexes. The first one, as shown below, will help us to identify the school. Its regular expression is School = (. *)\n. What do the symbols mean?. : Any character
*: 0 or more of the preceding expression
(. *): Placing part of a regular expression inside parentheses allows you to group that part of the expression. So, in this case, the grouped part is the name of the school.
\n: The newline character at the end of the line
We then need a regular expression for the grade. Its regular expression is Grade = (\d+)\n. This is very similar to the previous expression. The new symbols are:
\d: Short for [0-9]
+: 1 or more of the preceding expression
Finally, we need a regular expression to identify whether the table that follows the expression in the text file is a table of names or scores. Its regular expression is (Name|Score). The new symbol is:
|: Logical or statement, so in this case, it means ‘Name’ or ‘Score. ’
We also need to understand a few regular expression functions:
mpile(pattern): Compile a regular expression pattern into a RegexObject.
A RegexObject has the following methods:
match(string): If the beginning of string matches the regular expression, return a corresponding MatchObject instance. Otherwise, return None.
search(string): Scan through string looking for a location where this regular expression produced a match, and return a corresponding MatchObject instance. Return None if there are no matches.
A MatchObject always has a boolean value of True. Thus, we can just use an if statement to identify positive matches. It has the following method:
group(): Returns one or more subgroups of the match. Groups can be referred to by their index. group(0) returns the entire match. group(1) returns the first parenthesized subgroup and so on. The regular expressions we used only have a single group. Easy! However, what if there were multiple groups? It would get hard to remember which number a group belongs to. A Python specific extension allows us to name the groups and refer to them by their name instead. We can specify a name within a parenthesized group (… ) like so: (? P
Let us first define all the regular expressions. Be sure to use raw strings for regex, i. e., use the subscript r before each pattern.
1234567
# set up regular expressions
# use to visualise these if required
rx_dict = {
‘school’: mpile(r’School = (? P
‘grade’: mpile(r’Grade = (? P
‘name_score’: mpile(r'(? P
Then, we can define a function that checks for regex matches.
1 2 3 4 5 6 7 8 910111213
def _parse_line(line):
“””
Do a regex search against all defined regexes and
return the key and match result of the first matching regex
for key, rx in ():
match = (line)
if match:
return key, match
# if there are no matches
return None, None
Finally, for the main event, we have the file parser function. It is quite big, but the comments in the code should hopefully help you understand the logic.
1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
def parse_file(filepath):
Parse text at given filepath
Parameters
———-
filepath: str
Filepath for file_object to be parsed
Returns
——-
data: Frame
Parsed data
data = [] # create an empty list to collect the data
# open the file and read through it line by line
with open(filepath, ‘r’) as file_object:
line = adline()
while line:
# at each line check for a match with a regex
key, match = _parse_line(line)
# extract school name
if key == ‘school’:
school = (‘school’)
# extract grade
if key == ‘grade’:
grade = (‘grade’)
grade = int(grade)
# identify a table header
if key == ‘name_score’:
# extract type of table, i. e., Name or Score
value_type = (‘name_score’)
# read each line of the table until a blank line
while ():
# extract number and value
number, value = ()(‘, ‘)
value = ()
# create a dictionary containing this row of data
row = {
‘School’: school,
‘Grade’: grade,
‘Student number’: number,
value_type: value}
# append the dictionary to the data list
(row)
# create a pandas DataFrame from the list of dicts
data = Frame(data)
# set the School, Grade, and Student number as the index
t_index([‘School’, ‘Grade’, ‘Student number’], inplace=True)
# consolidate df to remove nans
data = oupby()()
# upgrade Score from float to integer
data = (_numeric, errors=’ignore’)
return data
We can use our parser on our sample text like so:
1234
if __name__ == ‘__main__’:
filepath = ”
data = parse(filepath)
print(data)
This is all well and good, and you can see by comparing the input and output by eye that the parser is working correctly. However, the best practice is to always write unittests to make sure your code is doing what you intended it to do. Whenever you write a parser, please ensure that it’s well tested. I have gotten into trouble with my colleagues for using parsers without testing before. Eeek! It’s also worth noting that this does not necessarily need to be the last step. Indeed, lots of programmers preach about Test Driven Development. I have not included a test suite here as I wanted to keep this tutorial concise.
I have been parsing text files for a year and perfected my method over time. Even so, I did some additional research to find out if there was a better solution. Indeed, I owe thanks to various community members who advised me on optimising my code. The community also offered some different ways of parsing the text file. Some of them were clever and exciting. My personal favourite was this one. I presented my sample problem and solution at the forums below:
Reddit post
Stackoverflow post
Code review post
If your problem is even more complex and regular expressions don’t cut it, then the next step would be to consider parsing libraries. Here are a couple of places to start with:
Parsing Horrible Things with Python:
A PyCon lecture by Erik Rose looking at the pros and cons of various parsing libraries.
Parsing in Python: Tools and Libraries:
Tools and libraries that allow you to create parsers when regular expressions are not enough.
Now that you understand how difficult and annoying it can be to parse text files, if you ever find yourselves in the privileged position of choosing a file format, choose it with care. Here are Stanford’s best practices for file formats.
I’d be lying if I said I was delighted with my parsing method, but I’m not aware of another way, of quickly parsing a text file, that is as beginner friendly as what I’ve presented above. If you know of a better solution, I’m all ears! I have hopefully given you a good starting point for parsing a file in Python! I spent a couple of months trying lots of different methods and writing some insanely unreadable code before I finally figured it out and now I don’t think twice about parsing a file. So, I hope I have been able to save you some time. Have fun parsing text with python!
How to Read a Text file In Python Effectively
Summary: in this tutorial, you learn various ways to read text files in;DRThe following shows how to read all texts from the file into a string:with open(”) as f:
lines = adlines()Code language: JavaScript (javascript)Steps for reading a text file in PythonTo read a text file in Python, you follow these steps:First, open a text file for reading by using the open(), read text from the text file using the file read(), readline(), or readlines() method of the file, close the file using the file close() method. 1) open() functionThe open() function has many parameters but you’ll be focusing on the first (path_to_file, mode)The path_to_file parameter specifies the path to the text the file is in the same folder as the program, you just need to specify the name of the file. Otherwise, you need to specify the path to the specify the path to the file, you use the forward-slash (‘/’) even if you’re working in example, if the file is stored in the sample folder as the program, you need to specify the path to the file as c:/sample/ mode is an optional parameter. It’s a string that specifies the mode in which you want to open the following table shows available modes for opening a text file:ModeDescription’r’Open for text file for reading text’w’Open a text file for writing text’a’Open a text file for appending textFor example, to open a file whose name is stored in the same folder as the program, you use the following code: f = open(”, ‘r’)Code language: JavaScript (javascript)The open() function returns a file object which you will use to read text from a text file. 2) Reading text methodsThe file object provides you with three methods for reading text from a text file:read() – read all text from a file into a string. This method is useful if you have a small file and you want to manipulate the whole text of that adline() – read the text file line by line and return all the lines as adlines() – read all the lines of the text file and return them as a list of strings. 3) close() methodThe file that you open will remain open until you close it using the close() ’s important to close the file that is no longer in use. If you don’t close the file, the program may crash or the file would be following shows how to call the close() method to close the ()Code language: CSS (css)To close the file automatically without calling the close() method, you use the with statement like this:with open(path_to_file) as f:
contents = adlines()Code language: JavaScript (javascript)In practice, you’ll use the with statement to close the file ading a text file examplesWe’ll use file for the following example illustrates how to use the read() method to read all the contents of the file into a string:with open(”) as f:
contents = ()
print(contents)Code language: JavaScript (javascript)Output:Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex….
The following example uses the readlines() method to read the text file and returns the file contents as a list of strings:lines = []
with open(”) as f:
lines = adlines()
count = 0
for line in lines:
count += 1
print(f’line {count}: {line}’) Code language: JavaScript (javascript)Output:line 1: Beautiful is better than ugly.
line 2: Explicit is better than implicit.
line 3: Simple is better than complex….
The following example shows how to use the readline() to read the text file line by line:with open(”) as f:
line = adline()
while line:
print(line)Code language: JavaScript (javascript)Output:Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated….
A more concise way to read a text file line by lineThe open() function returns a file object which is an iterable object. Therefore, you can use a for loop to iterate over the lines of a text file as follows:with open(”) as f:
for line in f:
print(line)Code language: JavaScript (javascript)This is more concise way to read a text file line by UTF-8 text filesThe code in the previous examples works fine with ASCII text files. However, if you’re dealing with other languages such as Japanese, Chinese, and Korean, the text file is not a simple ASCII text file. And it’s likely a UTF-8 file that uses more than just the standard ASCII text open a UTF-8 text file, you need to pass the encoding=’utf-8′ to the open() function to instruct it to expect UTF-8 characters from the the demonstration, you’ll use the following file that contains some quotes in following shows how to loop through the file:with open(”, encoding=’utf8′) as f:
print(())
Code language: JavaScript (javascript)Output:SummaryUse the open() function with the ‘r’ mode to open a text file for the read(), readline(), or readlines() method to read a text close a file after completing reading it using the close() method or the with the encoding=’utf-8′ to read the UTF-8 text you find this tutorial helpful?
How to Extract Specific Portions of a Text File Using Python
Updated: 06/30/2020 by
Extracting text from a file is a common task in scripting and programming, and Python makes it easy. In this guide, we’ll discuss some simple ways to extract text from a file using the Python 3 programming language.
Make sure you’re using Python 3
Reading data from a text file
Using “with open”
Reading text files line-by-line
Storing text data in a variable
Searching text for a substring
Incorporating regular expressions
Putting it all together
In this guide, we’ll be using Python version 3. Most systems come pre-installed with Python 2. 7. While Python 2. 7 is used in legacy code, Python 3 is the present and future of the Python language. Unless you have a specific reason to write or support Python 2, we recommend working in Python 3.
For Microsoft Windows, Python 3 can be downloaded from the Python official website. When installing, make sure the “Install launcher for all users” and “Add Python to PATH” options are both checked, as shown in the image below.
On Linux, you can install Python 3 with your package manager. For instance, on Debian or Ubuntu, you can install it with the following command:
sudo apt-get update && sudo apt-get install python3
For macOS, the Python 3 installer can be downloaded from, as linked above. If you are using the Homebrew package manager, it can also be installed by opening a terminal window (Applications → Utilities), and running this command:
brew install python3
Running Python
On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if you installed the launcher, the command is py. The commands on this page use python3; if you’re on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If you accidentally enter the interpreter, you can exit it using the command exit() or quit().
Running Python with a file name will interpret that python program. For instance:
python3.. the program contained in the file
Okay, how can we use Python to extract text from a text file?
First, let’s read a text file. Let’s say we’re working with a file named, which contains lines from the Lorem Ipsum example text.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc fringilla arcu congue metus aliquam mollis.
Mauris nec maximus purus. Maecenas sit amet pretium tellus.
Quisque at dignissim lacus.
NoteIn all the examples that follow, we work with the four lines of text contained in this file. Copy and paste the latin text above into a text file, and save it as, so you can run the example code using this file as input.
A Python program can read a text file using the built-in open() function. For example, the Python 3 program below opens for reading in text mode, reads the contents into a string variable named contents, closes the file, and prints the data.
myfile = open(“”, “rt”) # open for reading text
contents = () # read the entire file to string
() # close the file
print(contents) # print string contents
Here, myfile is the name we give to our file object.
The “rt” parameter in the open() function means “we’re opening this file to read text data”
The hash mark (“#”) means that everything on that line is a comment, and it’s ignored by the Python interpreter.
If you save this program in a file called, you can run it with the following command.
python3
The command above outputs the contents of
It’s important to close your open files as soon as possible: open the file, perform your operation, and close it. Don’t leave it open for extended periods of time.
When you’re working with files, it’s good practice to use the with compound statement. It’s the cleanest way to open a file, operate on it, and close the file, all in one easy-to-read block of code. The file is automatically closed when the code block completes.
Using with, we can rewrite our program to look like this:
with open (”, ‘rt’) as myfile: # Open for reading text
contents = () # Read the entire file to a string
print(contents) # Print the string
NoteIndentation is important in Python. Python programs use white space at the beginning of a line to define scope, such as a block of code. We recommend you use four spaces per level of indentation, and that you use spaces rather than tabs. In the following examples, make sure your code is indented exactly as it’s presented here.
Example
Save the program as and execute it:
Output:
In the examples so far, we’ve been reading in the whole file at once. Reading a full file is no big deal with small files, but generally speaking, it’s not a great idea. For one thing, if your file is bigger than the amount of available memory, you’ll encounter an error.
In almost every case, it’s a better idea to read a text file one line at a time.
In Python, the file object is an iterator. An iterator is a type of Python object which behaves in certain ways when operated on repeatedly. For instance, you can use a for loop to operate on a file object repeatedly, and each time the same operation is performed, you’ll receive a different, or “next, ” result.
For text files, the file object iterates one line of text at a time. It considers one line of text a “unit” of data, so we can use a loop statement to iterate one line at a time:
with open (”, ‘rt’) as myfile: # Open for reading
for myline in myfile: # For each line, read to a string,
print(myline) # and print the string.
Notice that we’re getting an extra line break (“newline”) after every line. That’s because two newlines are being printed. The first one is the newline at the end of every line of our text file. The second newline happens because, by default, print() adds a linebreak of its own at the end of whatever you’ve asked it to print.
Let’s store our lines of text in a variable — specifically, a list variable — so we can look at it more closely.
In Python, lists are similar to, but not the same as, an array in C or Java. A Python list contains indexed data, of varying lengths and types.
mylines = [] # Declare an empty list named mylines.
with open (”, ‘rt’) as myfile: # Open for reading text data.
for myline in myfile: # For each line, stored as myline,
(myline) # add its contents to mylines.
print(mylines) # Print the list.
The output of this program is a little different. Instead of printing the contents of the list, this program prints our list object, which looks like this:
[‘Lorem ipsum dolor sit amet, consectetur adipiscing elit. \n’, ‘Nunc fringilla arcu congue metus aliquam mollis. \n’, ‘Mauris nec maximus purus. Maecenas sit amet pretium tellus. \n’, ‘Quisque at dignissim lacus. \n’]
Here, we see the raw contents of the list. In its raw object form, a list is represented as a comma-delimited list. Here, each element is represented as a string, and each newline is represented as its escape character sequence, \n.
Much like a C or Java array, the list elements are accessed by specifying an index number after the variable name, in brackets. Index numbers start at zero — other words, the nth element of a list has the numeric index n-1.
NoteIf you’re wondering why the index numbers start at zero instead of one, you’re not alone. Computer scientists have debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why zero-based numbering is the best way to index data in computer science. You can read the memo yourself — he makes a compelling argument.
We can print the first element of lines by specifying index number 0, contained in brackets after the name of the list:
print(mylines[0])
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Or the third line, by specifying index number 2:
print(mylines[2])
But if we try to access an index for which there is no value, we get an error:
print(mylines[3])
Traceback (most recent call last):
File
IndexError: list index out of range
A list object is an iterator, so to print every element of the list, we can iterate over it with
mylines = [] # Declare an empty list
with open (”, ‘rt’) as myfile: # Open for reading text.
for line in myfile: # For each line of text,
(line) # add that line to the list.
for element in mylines: # For each element in the list,
print(element) # print it.
But we’re still getting extra newlines. Each line of our text file ends in a newline character (‘\n’), which is being printed. Also, after printing each line, print() adds a newline of its own, unless you tell it to do otherwise.
We can change this default behavior by specifying an end parameter in our print() call:
print(element, end=”)
By setting end to an empty string (two single quotes, with no space), we tell print() to print nothing at the end of a line, instead of a newline character.
Our revised program looks like this:
with open (”, ‘rt’) as myfile: # Open file
print(element, end=”) # print it without extra newlines.
The newlines you see here are actually in the file; they’re a special character (‘\n’) at the end of each line. We want to get rid of these, so we don’t have to worry about them while we process the file.
How to strip newlines
To remove the newlines completely, we can strip them. To strip a string is to remove one or more characters, usually whitespace, from either the beginning or end of the string.
TipThis process is sometimes also called “trimming. ”
Python 3 string objects have a method called rstrip(), which strips characters from the right side of a string. The English language reads left-to-right, so stripping from the right side removes characters from the end.
If the variable is named mystring, we can strip its right side with (chars), where chars is a string of characters to strip. For example, “123abc”(“bc”) returns 123a.
TipWhen you represent a string in your program with its literal contents, it’s called a string literal. In Python (as in most programming languages), string literals are always quoted — enclosed on either side by single (‘) or double (“) quotes. In Python, single and double quotes are equivalent; you can use one or the other, as long as they match on both ends of the string. It’s traditional to represent a human-readable string (such as Hello) in double-quotes (“Hello”). If you’re representing a single character (such as b), or a single special character such as the newline character (\n), it’s traditional to use single quotes (‘b’, ‘\n’). For more information about how to use strings in Python, you can read the documentation of strings in Python.
The statement (‘\n’) will strip a newline character from the right side of string. The following version of our program strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty list.
for myline in myfile: # For each line in the file,
((‘\n’)) # strip newline and add to list.
The text is now stored in a list variable, so individual lines can be accessed by index number. Newlines were stripped, so we don’t have to worry about them. We can always put them back later if we reconstruct the file and write it to disk.
Now, let’s search the lines in the list for a specific substring.
Let’s say we want to locate every occurrence of a certain phrase, or even a single letter. For instance, maybe we need to know where every “e” is. We can accomplish this using the string’s find() method.
The list stores each line of our text as a string object. All string objects have a method, find(), which locates the first occurrence of a substrings in the string.
Let’s use the find() method to search for the letter “e” in the first line of our text file, which is stored in the list mylines. The first element of mylines is a string object containing the first line of the text file. This string object has a find() method.
In the parentheses of find(), we specify parameters. The first and only required parameter is the string to search for, “e”. The statement mylines[0](“e”) tells the interpreter to search forward, starting at the beginning of the string, one character at a time, until it finds the letter “e. ” When it finds one, it stops searching, and returns the index number where that “e” is located. If it reaches the end of the string, it returns -1 to indicate nothing was found.
print(mylines[0](“e”))
3
The return value “3” tells us that the letter “e” is the fourth character, the “e” in “Lorem”. (Remember, the index is zero-based: index 0 is the first character, 1 is the second, etc. )
The find() method takes two optional, additional parameters: a start index and a stop index, indicating where in the string the search should begin and end. For instance, (“abc”, 10, 20) searches for the substring “abc”, but only from the 11th to the 21st character. If stop is not specified, find() starts at index start, and stops at the end of the string.
For instance, the following statement searchs for “e” in mylines[0], beginning at the fifth character.
print(mylines[0](“e”, 4))
24
In other words, starting at the 5th character in line[0], the first “e” is located at index 24 (the “e” in “nec”).
To start searching at index 10, and stop at index 30:
print(mylines[1](“e”, 10, 30))
28
(The first “e” in “Maecenas”).
If find() doesn’t locate the substring in the search range, it returns the number -1, indicating failure:
print(mylines[0](“e”, 25, 30))
-1
There were no “e” occurrences between indices 25 and 30.
Finding all occurrences of a substring
But what if we want to locate every occurrence of a substring, not just the first one we encounter? We can iterate over the string, starting from the index of the previous match.
In this example, we’ll use a while loop to repeatedly find the letter “e”. When an occurrence is found, we call find again, starting from a new location in the string. Specifically, the location of the last occurrence, plus the length of the string (so we can move forward past the last one). When find returns -1, or the start index exceeds the length of the string, we stop.
# Build array of lines from file, strip newlines
# Locate and print all occurences of letter “e”
substr = “e” # substring to search for.
for line in mylines: # string to be searched
index = 0 # current index: character being compared
prev = 0 # previous index: last character compared
while index < len(line): # While index has not exceeded string length,
index = (substr, index) # set index to first occurrence of "e"
if index == -1: # If nothing was found,
break # exit the while loop.
print(" " * (index - prev) + "e", end='') # print spaces from previous
# match, then the substring.
prev = index + len(substr) # remember this position for next loop.
index += len(substr) # increment the index by the length of substr.
# (Repeat until index > line length)
print(‘\n’ + line); # Print the original string under the e’s
e e e e e
e e
e e e e e e
e
For complex searches, use regular expressions.
The Python regular expressions module is called re. To use it in your program, import the module before you use it:
import re
The re module implements regular expressions by compiling a search pattern into a pattern object. Methods of this object can then be used to perform match operations.
For example, let’s say you want to search for any word in your document which starts with the letter d and ends in the letter r. We can accomplish this using the regular expression “\bd\w*r\b”. What does this mean?
character sequence
meaning
\b
A word boundary matches an empty string (anything, including nothing at all), but only if it appears before or after a non-word character. “Word characters” are the digits 0 through 9, the lowercase and uppercase letters, or an underscore (“_”).
d
Lowercase letter d.
\w*
\w represents any word character, and * is a quantifier meaning “zero or more of the previous character. ” So \w* will match zero or more word characters.
r
Lowercase letter r.
Word boundary.
So this regular expression will match any string that can be described as “a word boundary, then a lowercase ‘d’, then zero or more word characters, then a lowercase ‘r’, then a word boundary. ” Strings described this way include the words destroyer, dour, and doctor, and the abbreviation dr.
To use this regular expression in Python search operations, we first compile it into a pattern object. For instance, the following Python statement creates a pattern object named pattern which we can use to perform searches using that regular expression.
pattern = mpile(r”\bd\w*r\b”)
NoteThe letter r before our string in the statement above is important. It tells Python to interpret our string as a raw string, exactly as we’ve typed it. If we didn’t prefix the string with an r, Python would interpret the escape sequences such as \b in other ways. Whenever you need Python to interpret your strings literally, specify it as a raw string by prefixing it with r.
Now we can use the pattern object’s methods, such as search(), to search a string for the compiled regular expression, looking for a match. If it finds one, it returns a special result called a match object. Otherwise, it returns None, a built-in Python constant that is used like the boolean value “false”.
str = “Good morning, doctor. ”
pat = mpile(r”\bd\w*r\b”) # compile regex “\bd\w*r\b” to a pattern object
if (str)! = None: # Search for the pattern. If found,
print(“Found it. “)
Found it.
To perform a case-insensitive search, you can specify the special constant re. IGNORECASE in the compile step:
str = “Hello, DoctoR. ”
pat = mpile(r”\bd\w*r\b”, re. IGNORECASE) # upper and lowercase will match
if (str)! = None:
So now we know how to open a file, read the lines into a list, and locate a substring in any given list element. Let’s use this knowledge to build some example programs.
Print all lines containing substring
The program below reads a log file line by line. If the line contains the word “error, ” it is added to a list called errors. If not, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search case-insensitive without altering the original strings.
Note that the find() method is called directly on the result of the lower() method; this is called method chaining. Also, note that in the print() statement, we construct an output string by joining several strings with the + operator.
errors = [] # The list where we will store results.
linenum = 0
substr = “error”() # Substring to search for.
with open (”, ‘rt’) as myfile:
for line in myfile:
linenum += 1
if ()(substr)! = -1: # if case-insensitive match,
(“Line ” + str(linenum) + “: ” + (‘\n’))
for err in errors:
print(err)
Input (stored in):
This is line 1
This is line 2
Line 3 has an error!
This is line 4
Line 5 also has an error!
Line 3: Line 3 has an error!
Line 5: Line 5 also has an error!
Extract all lines containing substring, using regex
The program below is similar to the above program, but using the re regular expressions module. The errors and line numbers are stored as tuples, e. g., (linenum, line). The tuple is created by the additional enclosing parentheses in the () statement. The elements of the tuple are referenced similar to a list, with a zero-based index in brackets. As constructed here, err[0] is a linenum and err[1] is the associated line containing an error.
errors = []
pattern = mpile(“error”, re. IGNORECASE) # Compile a case-insensitive regex
if (line)! = None: # If a match is found
((linenum, (‘\n’)))
for err in errors: # Iterate over the list of tuples
print(“Line ” + str(err[0]) + “: ” + err[1])
Line 6: Mar 28 09:10:37 Error: cannot contact server. Connection refused.
Line 10: Mar 28 10:28:15 Kernel error: The specified location is not mounted.
Line 14: Mar 28 11:06:30 ERROR: usb 1-1: can’t set config, exiting.
Extract all lines containing a phone number
The program below prints any line of a text file,, which contains a US or international phone number. It accomplishes this with the regular expression “(\+\d{1, 2})? [\s. -]? \d{3}[\s. -]? \d{4}”. This regex matches the following phone number notations:
123-456-7890
(123) 456-7890
123 456 7890
123. 456. 7890
+91 (123) 456-7890
pattern = mpile(r”(\+\d{1, 2})? [\s. -]? \d{4}”)
if (line)! = None: # If pattern search finds a match,
print(“Line “, str(err[0]), “: ” + err[1])
Line 3: My phone number is 731. 215. 8881.
Line 7: You can reach Mr. Walters at (212) 558-3131.
Line 12: His agent, Mrs. Kennedy, can be reached at +12 (123) 456-7890
Line 14: She can also be contacted at (888) 312. 8403, extension 12.
Search a dictionary for words
The program below searches the dictionary for any words that start with h and end in pe. For input, it uses a dictionary file included on many Unix systems, /usr/share/dict/words.
filename = “/usr/share/dict/words”
pattern = mpile(r”\bh\w*pe$”, re. IGNORECASE)
with open(filename, “rt”) as myfile:
if (line)! = None:
print(line, end=”)
Hope
heliotrope
hope
hornpipe
horoscope
hype
Frequently Asked Questions about python parse text
How do you parse text in Python?
To read a text file in Python, you follow these steps:First, open a text file for reading by using the open() function.Second, read text from the text file using the file read() , readline() , or readlines() method of the file object.Third, close the file using the file close() method.
What is text parsing in Python?
Text parsing is a common programming task that splits the given sequence of characters or values (text) into smaller parts based on some rules. It has been used in a wide variety of applications ranging from simple file parsing to large scale natural language processing.Sep 10, 2010
How do you parse a text file?
How to parse text and metadata from TXT files onlineClick inside the file drop area to upload a TXT file or drag & drop a TXT file.Click “Get Text and Metadata” button to extract a text and metadata from your TXT document.Click “Get Images” button to extract images from your TXT document.More items…