Splitting and Merging Multiple Columns #194

tlyon3 · 2021-03-25T18:39:13Z

tlyon3
Mar 25, 2021

Running into a problem. This came up specifically with the South Carolina scrapers.

Scraping the PDFs, sometimes a dataframe comes out where a portion of the rows' data is combined in the first column. This seems to only happen for the first three columns.

In [46]: problem_df.iloc[:, [0,1,2]]
Out[46]:
                                                    0             1           2
0
1                             Providers\nCity\nCounty
2   Abbeville Area Medical Center Group\nAbbeville...
3   Abbeville Area Healthcare Center\nAbbeville \n...
4       Due West Family Medicine\nDue West\nAbbeville
..                                                ...           ...         ...
60                           Doctors Care - Northeast      Columbia    Richland
61                          Doctors Care - Orangeburg    Orangeburg  Orangeburg
62                           Doctors Care - Rock Hill     Rock Hill        York
63                              Doctors Care - Strand  Myrtle Beach       Horry
64                         Doctors Care - Summerville   Summerville  Dorchester

I need a way to recognize when this happens, and then split only the rows that are affected, and move the split data into the respective columns. I know I can use the df[0].split("\n", expand=True) method that will split the rows and create new columns, but I need those to be merged with their respective columns that already exist in the original df.

Answered by ghop02

Mar 26, 2021

How difficult/stable is it to modify the camelot settings to better identify the header columns? or only pull a subset of those header column

View full answer

sglyon · 2021-03-26T14:16:41Z

sglyon
Mar 26, 2021

I think you are on the right track with the .split("\n", expand=True) idea

Would it work to first create a boolean index for which rows are impacted by this? Perhaps checking for columns 1 and 2 to be empty & column 0 having two \ns?

Something like this:

col_names = list(df)
bad_rows = df.loc[:, col_names[1:3]].isna().all(axis=1) & (df.loc[: col_names[0]].str.count("\n") == 2)
df.loc[bad_rows, col_names[1:3]] = df.loc[bad_rows, col_names[0]).str.split("\n", expand=True)

1 reply

tlyon3 Mar 26, 2021
Author

This almost works. Problem is some of the entries don't follow the heuristic of exactly two '\n'.

Row number 2 for example is this: 'Abbeville Area Medical Center Group\nAbbeville Area Medical Center\nAbbeville \nAbbeville'

It looks like this is more of a pdf parsing problem now because here is the original table:

Not sure what to do about this scraper now...

ghop02 · 2021-03-26T18:52:02Z

ghop02
Mar 26, 2021

How difficult/stable is it to modify the camelot settings to better identify the header columns? or only pull a subset of those header column

2 replies

tlyon3 Mar 26, 2021
Author

I'm not sure how difficult it is to change settings. I'm also not sure what caveats will come into play by changing that because it parses correctly about 90% of the time. There are just some files that don't get parsed correctly. In fact, it's not even a the whole pdf, it's a single page in random files where it happens. I haven't been able to identify when it happens.

tlyon3 Apr 1, 2021
Author

Using the documentation I was able to tweak the settings to get it to parse correctly.

The specific settings I needed to set were line_scale=40 and split_text=True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting and Merging Multiple Columns #194

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Splitting and Merging Multiple Columns #194

tlyon3 Mar 25, 2021

Replies: 2 comments · 3 replies

sglyon Mar 26, 2021

tlyon3 Mar 26, 2021 Author

ghop02 Mar 26, 2021

tlyon3 Mar 26, 2021 Author

tlyon3 Apr 1, 2021 Author

tlyon3
Mar 25, 2021

Replies: 2 comments 3 replies

sglyon
Mar 26, 2021

tlyon3 Mar 26, 2021
Author

ghop02
Mar 26, 2021

tlyon3 Mar 26, 2021
Author

tlyon3 Apr 1, 2021
Author