Brtdku

Question

I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.

I have tried quite a few solutions but nothing seems to be yielding a positive result.

This is what I have tried so far

path_to_json = '/path/'

json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])

for index, js in enumerate(json_files):

    with open(os.path.join(path_to_json, js)) as json_file:

        json_text = json.load(json_file)

        participants = json_text['participants']

        messages = json_text['messages']

        active = json_text['is_still_participant']

        threadtype = json_text['thread_type']

        threadpath = json_text['thread_path']

        jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]

jsons_data

And this is the full traceback of error message I am receiving:

---------------------------------------------------------------------------

JSONDecodeError                           Traceback (most recent call last)

<ipython-input-30-8385abf6a3a7> in <module>()

      1 for index, js in enumerate(json_files):

      2     with open(os.path.join(path_to_json, js)) as json_file:

----> 3         json_text = json.load(json_file)

      4         participants = json_text['participants']

      5         messages = json_text['messages']



/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)

    297         cls=cls, object_hook=object_hook,

    298         parse_float=parse_float, parse_int=parse_int,

--> 299         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

    300 

    301 



/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)

    352             parse_int is None and parse_float is None and

    353             parse_constant is None and object_pairs_hook is None and not kw):

--> 354         return _default_decoder.decode(s)

    355     if cls is None:

    356         cls = JSONDecoder



/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)

    337 

    338         """

--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())

    340         end = _w(s, end).end()

    341         if end != len(s):



/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)

    355             obj, end = self.scan_once(s, idx)

    356         except StopIteration as err:

--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None

    358         return obj, end



JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I have added a sample of the json files I am trying to read from

Link to Jsons

Example of jsons:

{

participants: [

{

name: "Test 1"

},

{

name: "Person"

}

],

messages: [

{

sender_name: "Person",

timestamp_ms: 1485467319139,

content: "Hie",

type: "Generic"

}

],

title: "Test 1",

is_still_participant: true,

thread_type: "Regular",

thread_path: "inbox/xyz"

}

#second example

{

participants: [

{

name: "Clearance"

},

{

name: "Person"

}

],

messages: [

{

sender_name: "Emmanuel Sibanda",

timestamp_ms: 1212242073308,

content: "Dear",

share: {

link: "http://www.example.com/"

},

type: "Share"

}

],

title: "Clearance",

is_still_participant: true,

thread_type: "Regular",

thread_path: "inbox/Clearance"

}

Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files. — Jan 14 at 7:44
These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator. — Jan 14 at 8:01
Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal! — Jan 14 at 20:01

score 1 · Accepted Answer · 2019-01-14 08:59:44Z

1

I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.

For example, document1.json should be corrected as:

{

"participants": [

{

"name": "Clothing"

},

{

"name": "Person"

}

],

"messages": [

{

"sender_name": "Person",

"timestamp_ms": 1210107456233,

"content": "Good day",

"type": "Generic"

}

],

"title": "Clothing",

"is_still_participant": true,

"thread_type": "Regular",

"thread_path": "inbox/Clothing"

}

EDIT: you can use following line to add double quote to the keys of a json file:

re.sub("([^s^"]+):(.+)", '"\1":\2', s)

edited Jan 14 at 8:59

answered Jan 14 at 7:44

keineahnung2345

60012

I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

– Emm
Jan 14 at 7:48

The regex string is helpful!

– amanb
Jan 14 at 9:57

Hi @Emm if this or any answer has solved your question please consider accepting it .

– keineahnung2345
Jan 15 at 9:18

add a comment |

score 1 · Accepted Answer · 2019-01-22 11:57:13Z

There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.

Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.

I've broken down the problem into the following parts:

convert malformed JSON's in the files to valid JSON's

flatten the dict's in the valid JSON files to prepare for dataframe conversion

create dataframes from the files and merge into one dataframe

Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.

Part 1: convert malformed JSON's in the files to valid JSON's:

The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.

with open('./Test/test.json') as f:

    data = json.load(f)

print(data)

#Error:

JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

The only way I found to get around this problem was:

to convert all the JSON files to txt files as this would convert the contents to string

perform regex on the JSON string in text files and add quotes(" ") around the keys

save the file as JSON again

The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.

import json

import os

import re 

import pandas as pd



#rename to txt files and return list of filenames

def rename_to_text_files():

    all_new_filenames = 

    for filename in os.listdir('./Test'):

        if filename.endswith("json"):

            new_filename = filename.split('.')[0] + '.txt'   

            os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))

            all_new_filenames.append(new_filename)

        else:

            all_new_filenames.append(filename)

    return all_new_filenames     



#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames

def fix_dict_rename_to_json_files(files):

    json_validated_files =   

    for index, filename in enumerate(files):

        filepath = os.path.join('./Test',filename)

        with open(filepath,'r+') as f:

            data = f.read()            

            dict_converted = re.sub("(w+):(.+)", r'"1":2', data)

            f.seek(0)

            f.write(dict_converted)

            f.truncate()

    #rename            

        new_filename = filename[:-4] + '.json'  

        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))

        json_validated_files.append(new_filename)        

    print("All files converted to valid JSON!")        

    return json_validated_files

So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":

#test.json

{

"participants": [

{

"name": "Test 1"

},

{

"name": "Person"

}

],

"messages": [

{

"sender_name": "Person",

"timestamp_ms": 1485467319139,

"content": "Hie",

"type": "Generic"

}

],

"title": "Test 1",

"is_still_participant": true,

"thread_type": "Regular",

"thread_path": "inbox/xyz"

}

If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:

df = pd.read_json('./Test/test.json')

print(df)

#Error

ValueError: arrays must all be same length

In the next part, we fix this problem by flattening the dict in the JSON.

Part 2: Flatten dict for dataframe conversion:

As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:

#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs

def flatten(d):

    values = 

    d['participants_name'] = d.pop('participants')

    for i in d['participants_name']:

        values.append(i['name'])

    for i in d['messages']:

        d['messages_sender_name'] = i['sender_name']

        d['messages_timestamp_ms'] = str(i['timestamp_ms'])

        d['messages_content'] = i['content']

        d['messages_type'] = i['type']

        if "share" in i:

            d['messages_share_link'] = i["share"]["link"]

    d["is_still_participant"] = str(d["is_still_participant"])

    d.pop('messages')

    d.update(participants_name=values)                    

    return d

This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:

#test1.json

{

"participants": [

{

"name": "Clearance"

},

{

"name": "Person"

}

],

"messages": [

{

"sender_name": "Emmanuel Sibanda",

"timestamp_ms": 1212242073308,

"content": "Dear",

"share": {

"link": "http://www.example.com/"

},

"type": "Share"

}

],

"title": "Clearance",

"is_still_participant": true,

"thread_type": "Regular",

"thread_path": "inbox/Clearance"

}

When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):

with open('./Test/test1.json') as f:

    data = json.load(f)



print(flatten(data))

#Output:

    {'title': 'Clearance',

 'is_still_participant': 'True',

 'thread_type': 'Regular',

 'thread_path': 'inbox/Clearance',

 'participants_name': ['Clearance', 'Person'],

 'messages_sender_name': 'Emmanuel Sibanda',

 'messages_timestamp_ms': '1212242073308',

 'messages_content': 'Dear',

 'messages_type': 'Share',

 'messages_share_link': 'http://www.example.com/'}

Part 3: Create Dataframes and merge them into one:

So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:

open the JSON files one by one, load each JSON as a dict in memory using json.load().

call the flatten function on each dict

convert the flattened dicts to dataframes

append all dataframes to an empty list.

merge all dataframes with pd.concat() passing the list of dataframes as an argument.

The code to accomplish these tasks:

#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.



def create_merge_dataframes(list_of_valid_json_files):

    df_list = 

    for index, js in enumerate(list_of_valid_json_files):

        with open(os.path.join('./Test', js)) as json_file:  

            data = json.load(json_file)

            flattened_json_data = flatten(data)    

            df = pd.DataFrame(flattened_json_data)

            df_list.append(df)

    merged_df = pd.concat(df_list,sort=False, ignore_index=True)

    return merged_df

Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.

#rename invalid JSON files to text

files = rename_to_text_files()



#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function

json_validated_files = fix_dict_rename_to_json_files(files)



#flatten and receive merged dataframes

df = create_merge_dataframes(json_validated_files)

print(df)

The final Dataframe:

        title is_still_participant thread_type      thread_path  

0     Test 1                 True     Regular        inbox/xyz

1     Test 1                 True     Regular        inbox/xyz

2  Clearance                 True     Regular  inbox/Clearance

3  Clearance                 True     Regular  inbox/Clearance



  participants_name messages_sender_name messages_timestamp_ms  

0            Test 1               Person         1485467319139

1            Person               Person         1485467319139

2         Clearance     Emmanuel Sibanda         1212242073308

3            Person     Emmanuel Sibanda         1212242073308



  messages_content messages_type      messages_share_link

0              Hie       Generic                      NaN

1              Hie       Generic                      NaN

2             Dear         Share  http://www.example.com/

3             Dear         Share  http://www.example.com/

You can change the order of columns as you like.

Note:

The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

The shape and columns of the Dataframes has also been assumed

You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

The whole script can be further modularized by putting the functions into a class.

It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.

The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.

score 1 · Accepted Answer · 2019-01-14 08:59:44Z

1

I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.

For example, document1.json should be corrected as:

{

"participants": [

{

"name": "Clothing"

},

{

"name": "Person"

}

],

"messages": [

{

"sender_name": "Person",

"timestamp_ms": 1210107456233,

"content": "Good day",

"type": "Generic"

}

],

"title": "Clothing",

"is_still_participant": true,

"thread_type": "Regular",

"thread_path": "inbox/Clothing"

}

EDIT: you can use following line to add double quote to the keys of a json file:

re.sub("([^s^"]+):(.+)", '"\1":\2', s)

edited Jan 14 at 8:59

answered Jan 14 at 7:44

keineahnung2345

60012

I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

– Emm
Jan 14 at 7:48

The regex string is helpful!

– amanb
Jan 14 at 9:57

Hi @Emm if this or any answer has solved your question please consider accepting it .

– keineahnung2345
Jan 15 at 9:18

add a comment |

score 1 · Accepted Answer · 2019-01-22 11:57:13Z

There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.

Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.

I've broken down the problem into the following parts:

convert malformed JSON's in the files to valid JSON's

flatten the dict's in the valid JSON files to prepare for dataframe conversion

create dataframes from the files and merge into one dataframe

Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.

Part 1: convert malformed JSON's in the files to valid JSON's:

The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.

with open('./Test/test.json') as f:

    data = json.load(f)

print(data)

#Error:

JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

The only way I found to get around this problem was:

to convert all the JSON files to txt files as this would convert the contents to string

perform regex on the JSON string in text files and add quotes(" ") around the keys

save the file as JSON again

The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.

import json

import os

import re 

import pandas as pd



#rename to txt files and return list of filenames

def rename_to_text_files():

    all_new_filenames = 

    for filename in os.listdir('./Test'):

        if filename.endswith("json"):

            new_filename = filename.split('.')[0] + '.txt'   

            os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))

            all_new_filenames.append(new_filename)

        else:

            all_new_filenames.append(filename)

    return all_new_filenames     



#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames

def fix_dict_rename_to_json_files(files):

    json_validated_files =   

    for index, filename in enumerate(files):

        filepath = os.path.join('./Test',filename)

        with open(filepath,'r+') as f:

            data = f.read()            

            dict_converted = re.sub("(w+):(.+)", r'"1":2', data)

            f.seek(0)

            f.write(dict_converted)

            f.truncate()

    #rename            

        new_filename = filename[:-4] + '.json'  

        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))

        json_validated_files.append(new_filename)        

    print("All files converted to valid JSON!")        

    return json_validated_files

So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":

#test.json

{

"participants": [

{

"name": "Test 1"

},

{

"name": "Person"

}

],

"messages": [

{

"sender_name": "Person",

"timestamp_ms": 1485467319139,

"content": "Hie",

"type": "Generic"

}

],

"title": "Test 1",

"is_still_participant": true,

"thread_type": "Regular",

"thread_path": "inbox/xyz"

}

If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:

df = pd.read_json('./Test/test.json')

print(df)

#Error

ValueError: arrays must all be same length

In the next part, we fix this problem by flattening the dict in the JSON.

Part 2: Flatten dict for dataframe conversion:

As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:

#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs

def flatten(d):

    values = 

    d['participants_name'] = d.pop('participants')

    for i in d['participants_name']:

        values.append(i['name'])

    for i in d['messages']:

        d['messages_sender_name'] = i['sender_name']

        d['messages_timestamp_ms'] = str(i['timestamp_ms'])

        d['messages_content'] = i['content']

        d['messages_type'] = i['type']

        if "share" in i:

            d['messages_share_link'] = i["share"]["link"]

    d["is_still_participant"] = str(d["is_still_participant"])

    d.pop('messages')

    d.update(participants_name=values)                    

    return d

This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:

#test1.json

{

"participants": [

{

"name": "Clearance"

},

{

"name": "Person"

}

],

"messages": [

{

"sender_name": "Emmanuel Sibanda",

"timestamp_ms": 1212242073308,

"content": "Dear",

"share": {

"link": "http://www.example.com/"

},

"type": "Share"

}

],

"title": "Clearance",

"is_still_participant": true,

"thread_type": "Regular",

"thread_path": "inbox/Clearance"

}

When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):

with open('./Test/test1.json') as f:

    data = json.load(f)



print(flatten(data))

#Output:

    {'title': 'Clearance',

 'is_still_participant': 'True',

 'thread_type': 'Regular',

 'thread_path': 'inbox/Clearance',

 'participants_name': ['Clearance', 'Person'],

 'messages_sender_name': 'Emmanuel Sibanda',

 'messages_timestamp_ms': '1212242073308',

 'messages_content': 'Dear',

 'messages_type': 'Share',

 'messages_share_link': 'http://www.example.com/'}

Part 3: Create Dataframes and merge them into one:

So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:

open the JSON files one by one, load each JSON as a dict in memory using json.load().

call the flatten function on each dict

convert the flattened dicts to dataframes

append all dataframes to an empty list.

merge all dataframes with pd.concat() passing the list of dataframes as an argument.

The code to accomplish these tasks:

#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.



def create_merge_dataframes(list_of_valid_json_files):

    df_list = 

    for index, js in enumerate(list_of_valid_json_files):

        with open(os.path.join('./Test', js)) as json_file:  

            data = json.load(json_file)

            flattened_json_data = flatten(data)    

            df = pd.DataFrame(flattened_json_data)

            df_list.append(df)

    merged_df = pd.concat(df_list,sort=False, ignore_index=True)

    return merged_df

Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.

#rename invalid JSON files to text

files = rename_to_text_files()



#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function

json_validated_files = fix_dict_rename_to_json_files(files)



#flatten and receive merged dataframes

df = create_merge_dataframes(json_validated_files)

print(df)

The final Dataframe:

        title is_still_participant thread_type      thread_path  

0     Test 1                 True     Regular        inbox/xyz

1     Test 1                 True     Regular        inbox/xyz

2  Clearance                 True     Regular  inbox/Clearance

3  Clearance                 True     Regular  inbox/Clearance



  participants_name messages_sender_name messages_timestamp_ms  

0            Test 1               Person         1485467319139

1            Person               Person         1485467319139

2         Clearance     Emmanuel Sibanda         1212242073308

3            Person     Emmanuel Sibanda         1212242073308



  messages_content messages_type      messages_share_link

0              Hie       Generic                      NaN

1              Hie       Generic                      NaN

2             Dear         Share  http://www.example.com/

3             Dear         Share  http://www.example.com/

You can change the order of columns as you like.

Note:

The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

The shape and columns of the Dataframes has also been assumed

You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

The whole script can be further modularized by putting the functions into a class.

It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.

The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.

搜尋此網誌

Brtdku

Trying to load multiple json files and merge into one pandas dataframe

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

How can a duplicate class be excluded from sbt assembly?

Cakephp 3.6: Create new view with dropdown and check boxes

database size increased after restoring on another drive

Trying to load multiple json files and merge into one pandas dataframe

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How can a duplicate class be excluded from sbt assembly?

Cakephp 3.6: Create new view with dropdown and check boxes

database size increased after restoring on another drive

2 Answers
2

2 Answers
2

2 Answers
2