A recent update of tabula-py

Photo by [Joshua Rawson-Harris](https://unsplash.com/@joshrh19?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)
Photo by Joshua Rawson-Harris on Unsplash
Photo by Joshua Rawson-Harris on Unsplash

This article is a repost of Patreon article published last December. I’m planning to bump up the next version of tabula-py within few weeks.

(Note: Oct 7th, 2019) As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

This is my first post on patreon. Apologies for delayed announcement of recent update of tabula-py. I will introduce the key features of updates.

Use Tabula app template

Tabula app has template exporting feature to reuse same bounding box for extraction. tabula-py now load and extract with tabula app’s template.

dfs = tabula.read_pdf_with_template(
  './examples/data.pdf',
  './examples/data.tabula-template.json',
  pandas_options={'header': 0})

Support file-like object

Like many python libraries, tabula-py has been able to extract from file-like object.

# With file-like object  
pdf\_path = tests/resources/data.pdf  
with open(pdf\_path, rb) as f:  
  df = tabula.read_pdf(f)

# With pathlib  
from pathlib import Path  
pdf_path = 'tests/resources/data.pdf'
df = tabula.read_pdf(Path(pdf_path))

Allow multiple area option

As of tabula-java v1.0.2, tabula can handle multiple area option.

pdf_path = 'tests/resources/MultiColumn.pdf'
# Relative area  
df_relative = tabula.read_pdf(  
  pdf_path, pages=1,
  area=[[0, 0, 100, 50], [0, 50, 100, 100]], relative_area=True)  

# Absolute area  
  df_absolute = tabula.read_pdf(  
    pdf_path, pages=1, area=[[0, 0, 451, 212], [0, 212, 451, 425]])

Tip: Get table position

This is not a new feature, but I think it might be helpful for some PDFs.
Detailed post: https://github.com/chezou/tabula-py/issues/102

read_pdf with JSON contains position info, so you can get the table position as follows:

In [5]: tables = read_pdf("./examples/data.pdf", output_format="json", page=2)  
In [9]: top = tables[0]['top']  
In [10]: left = tables[0]['left']
In [11]: bottom = tables[0]['height'] + top  
In [12]: right = tables[0]['width'] + left  
In [13]: top, bottom, left, right  
Out[13]: (0.0, 528.8800048828125, 0.0, 564.8800048828125)

If you have any question, ask on Stack Overflow!

Other tabula-py articles

Aki Ariga
Aki Ariga
Staff Software Engineer

Interested in Machine Learning, ML Ops, and Data driven business. If you like my blog post, I’m glad if you can buy me a tea 😉

  Gift a cup of Tea

Related